## AI's Interpretation 基于论文《Distributional AGI Safety》中提出的“AGI 是一个 DAO(去中心化自治组织)而非 CEO”的愿景,未来的 AI 生态将从单一模型的研发转向**多智能体经济系统的治理**。 这意味着单纯的“写代码”或“训练模型”将不再是唯一的金饭碗。以下是基于论文观点推导出的、未来极具前景的新兴职业和专业知识领域: ### 1. AI 经济学家与机制设计专家 (AI Economist & Mechanism Designer) 论文明确指出,安全范式将从“计算机科学”转向“经济学和博弈论”。 * **核心职责**:设计 Agent 市场的激励模型。你需要设计“Gas 费”机制(防止垃圾数据)、质押与罚没机制(PoS,防止作恶)以及定价策略。 * **关键技能**:**博弈论 (Game Theory)**、微观经济学、市场设计、激励相容性分析。 * **前景**:未来的 AGI 系统本质上是一个高频交易市场,如何让成千上万个 Agent 不合谋、不恶性竞争,全靠这套经济规则。 ### 2. 智能合约与计算法律专家 (Computational Lawyer / Smart Contract Architect) 论文提到治理 AGI 靠的是“智能合约”而非传统法律,“代码即法律”。 * **核心职责**:将人类的法律法规和安全约束翻译成 Agent 能执行的**智能合约代码**。例如,编写一个合约,规定“只有在 Oracle 验证结果无误后,Agent 才能收到付款”。 * **关键技能**:**Solidity/Rust 等合约编程**、法律逻辑、形式化验证 (Formal Verification)。 * **前景**:连接现实法律世界与数字 Agent 世界的桥梁。 ### 3. AI 取证与审计师 (AI Forensics Analyst & Auditor) 论文强调了“记录上链”和“取证分析”的重要性,因为在分布式系统中,很难搞清楚到底是哪个 Agent 搞砸了(归因难题)。 * **核心职责**:当系统出问题时,分析链上不可篡改的账本,追踪决策链条,确定是哪个 Agent 或哪个环节出了问题。同时负责对 Agent 进行“上岗前”的安全审计。 * **关键技能**:**区块链取证**、数据分析、日志追踪、因果推断。 * **前景**:就像现在的金融审计师,但审计对象是硅基生命体的交易记录。 ### 4. Agent 编排师与交互架构师 (Agent Orchestrator) 论文提到,人类的主要角色将转变为“编排和验证”。 * **核心职责**:你不需要自己做任务,而是要懂得如何组建一支“Agent 特种部队”。你需要知道对于一个复杂任务(如金融分析),应该调用哪个搜索 Agent、哪个数据清洗 Agent 和哪个代码执行 Agent,并设计它们之间的工作流。 * **关键技能**:**系统架构设计**、Prompt Engineering(进阶版)、API 集成、流程管理。 * **前景**:这是未来版的“项目经理”,但管理的员工全是 AI。 ### 5. 去中心化身份 (DID) 与信任架构师 论文提到了“链上身份”和防女巫攻击的重要性。 * **核心职责**:管理 Agent 的数字身份(DID),确保每个 Agent 背后都有可追溯的法律实体,并维护一套基于声誉(Reputation)的信任系统。 * **关键技能**:**密码学 (Cryptography)**、公钥基础设施 (PKI)、零知识证明 (ZK-Proofs)。 * **前景**:在充斥着亿万 AI 的网络中,“你是谁”和“你是否可信”将是基础设施级的需求。 ### 总结:未来的通才 这篇论文实际上在暗示,未来的顶级人才不再是单一学科的专家,而是**“计算机 + 经济学 + 法律”**的复合型人才。 如果你想在那个“AGI 即 DAO”的未来占据一席之地,现在除了学习 AI 技术,最好开始关注 **Web3(区块链逻辑)、博弈论和制度经济学**。 ## Chao Wang 刚读完Google最新的AGI论文,论文的观点大胆且非常crypto native,我一度以为是在看加密项目的白皮书。 几个核心观点: **1\. AGI终将是一个 DAO,而不是一个 CEO。** 我们总是幻想某天一觉醒来,诞生了 GPT-10 这样的全知全能神。论文指出未来的AGI大概率是分布式的 。就像公司没有一个人能擅长所有事情,AGI 将是一个由无数互补的“特种 Agent”组成的网络 。在这个网络里,没有单一的中心,超级智能是在Agent疯狂的交易和协作中涌现出来的 。 换句话说,AGI不是一个实体,而是一种市场状态。 **2.治理AGI不能靠“法律”,得靠“智能合约”。** 既然从单一模型变成了市场,安全范式必须从“心理学”转向“治理学” 。以前 AI 安全是做一个超级大脑的对齐。但面对每秒亿万次的高频交互,人类监管是无效的。因此必须引入智能合约。Agent 完成任务后,由Oracle 验证结果,自动执行结算 。代码即法律,不满足安全约束,资金就无法流转。 **3\. 引入“质押”与“罚没”。** 怎么防止 Agent 作恶?研究团队居然复刻了PoS机制。Agent想接大单?先Stake资产 。一旦被审计发现作恶,智能合约直接罚没质押金 。这种基于经济抵押的信任,比单纯的代码审查更有效。 **4\. 链上身份与“Gas 费”调节** DID:为了防女巫攻击,每个 Agent 必须有基于公钥密码学的唯一身份,且绑定法律实体。 动态Gas:为了防垃圾数据污染,建议对Agent的操作征收动态费用 。这不就是以太坊拥堵时的 Gas Fee 调节机制吗? **5\. 记录上链** 所有的决策和交易历史,必须记录在加密安全的、只能追加不能修改的账本上 。这为了确保当系统出问题时,能进行取证分析,没人能赖账 。 这篇论文标志着 AI 安全的一个范式转移:从单纯的计算机科学和价值观对齐,变成了经济学和博弈论 。 而加密圈子折腾了十几年的DID、智能合约、预言机和经济模型,治理机制等等等等。无论成熟与不成熟,这些探索至少迈出去一些步,而这些步,也许刚好是给未来那个庞大的、去中心化的硅基生命群体的基础。 未来的 AGI 安全专家,可能不需要是一个能写代码的工程师,但必须是一个懂博弈论、市场设计和去中心化治理的“AI 经济学家”。我们本质是要设计在这个新物种的共识、激励和治理模型,而不仅仅是修补大模型的神经元。 ## Abstract AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this *patchwork* AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for *distributional AGI safety* that moves beyond evaluating and aligning individual agents. This framework centers on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks. ## 1 Introduction Rapid advances in AI capabilities need to be complemented by the development of robust frameworks for safety, oversight, and alignment (gabriel2024ethics). AI alignment (everitt2018agi; tegmark2023provably) is particularly important in case of autonomous AI agents (kasirzadeh2025characterizing; cihon2025measuring), and is one of the key considerations on the path to developing safe artificial general intelligence (AGI), a general-purpose AI system capable of performing any task that humans can routinely perform. Other approaches may involve continuous monitoring for the emergence of dangerous capabilities (phuong2024evaluating; bova2024quantifyingdetectionratesdangerous; shah2025approachtechnicalagisafety), or involve different frameworks for containment (babcock2016agi). Mechanistic interpretability and formal verifiability remain of interest (tegmark2023provablysafesystemspath), though the complexity of modern agentic systems presents a practical challenge. In absence of strict controls and mitigations, powerful AGI capabilities may potentially lead to a number of catastrophic risks (hendrycks2023overviewcatastrophicairisks). The majority of contemporary AI safety and alignment methods have been developed with a single powerful A(G)I entity in mind. This includes methods like reinforcement learning from human feedback RLHF (christiano2017deep; bai2022training), constitutional AI (bai2022constitutional), process supervision (luo2024improve), value alignment (eckersley2018impossibility; gabriel2020artificial; gabriel2022challenge; klingefjord2024human), chain of thought (CoT) monitoring (korbak2025chain; emmons2025chainthoughtnecessarylanguage), and others. These types of methods are being routinely utilized in the development and testing of large language models (LLM), to ensure desirable behavior at deployment. In the context of the hypothetical future emergence of AGI, this would be conceptually appropriate if AGI were to first emerge as an individual AI agent, developed by a specific institution. In principle, this would enable the developers to utilize the testing frameworks to confirm the capability level of the system, characterize its alignment, make improvements and mitigations, deploy appropriate safeguards, and take any number of necessary steps in line with regulations and the societal expectations. However, this overlooks a highly plausible alternative scenario for the emergence of AGI - specifically, the emergence of AGI via the interaction of sub-AGI Agents within groups or systems. Sub-AGI agents can form Group Agents, the same way humans do in the form of corporations (list2011group; list2021group; franklin2023general). These collective structures would function as coherent entities that perform actions that no single agent could perform independently (simon2012architecture; haken1977synergetics; von1976objects). Alternatively, like humans engaging in financial markets, sub-AGI agents could interact within complex systems, where individual decisions driven by personal incentives and information, aggregated through mechanisms (such as price signals), could result in the emergence of capacities that surpass that of any single participant within the system. It is possible that that sub-AGI agents will both form groups - such as fully automated firms (Patel2025\_aiFirm) - and engage within systems - such as Virtual Agent Economies (tomasev2025virtual). In either scenario, AGI may initially emerge as a *patchwork* system, distributed (drexler2019reframing; montes2019distributed; gibson2025modular; tallam2025autonomous) across entities within a network. A Patchwork AGI would be comprised of a group of individual sub-AGI agents, with complementary skills and affordances. General intelligence in the patchwork AGI system would arise primarily as collective intelligence. Individual agents could delegate tasks to each other, routing each task to the appropriate agent with the highest individual skill, or with access to the most appropriate tools. For certain functions, it may well be more economical to use narrower specialist agents. In cases when no single agent possesses the appropriate skill level, the tasks could be further decomposed or reframed, or performed in collaboraiton with other agents. The economic argument for a multi-agent future over a single, monolithic AGI stems from the principles of scarcity and dispersed knowledge. A lone, frontier model is a one-size-fits-all solution that is prohibitively expensive for the vast majority of tasks, meaning its marginal benefit rarely justifies its cost, which is why businesses often choose cheaper, "good enough" models. Should the frontier models become substantially cheaper, custom specialized models may still be available at a slightly cheaper price point. This reality creates a demand-driven ecosystem where countless specialized, fine-tuned, and cost-effective agents emerge to serve specific needs, much like a market economy. Consequently, progress looks less like building a single omni-capable frontier model and more like developing sophisticated systems (e.g., routers) to orchestrate this diverse array of agents. AGI, in this view, is not an entity but a "state of affairs": a mature, decentralized economy of agents where the primary human role is orchestration and verification, a system that discovers and serves real-world needs far more efficiently than any centralized model ever could. This is despite the fact that centralized agentic systems may potentially incur fewer inefficiencies compared to centralized human organizations. AI agents may communicate, deliberate, and ultimately achieve goals that no single agent would have been capable of. While we are discussing these scenarios here from the perspective of safety, and while bespoke risks in multi-agent systems have been recognized (hammond2025multi), multi-agent systems are ultimately being developed precisely with the hope of yielding performance improvements (chen2023agentverse) and scaling to larger problem instances (ishibashi2024self). The complexity of the emergent behaviors (Baker2020Emergent) may greatly exceed the complexity of the underlying environment (bansal2018emergentcomplexitymultiagentcompetition). While the framework presented here pertains to the future large-scale agentic web rather than the present-day ecosystem or any individual agent currently in use, it is important to preemptively engage with these emerging possibilities. With this in mind, we proceed by reviewing the patchwork AGI scenario in more depth. ## 2 Patchwork AGI Scenario For AGI to be able to perform all the tasks that humans perform, it needs to possess a diverse set of skills and cognitive abilities. This includes perception, understanding, knowledge, reasoning, short-term and long-term memory, theory of mind, creativity, inquisitiveness, and many others. So far, no individual model or AI agent has come close to satisfying all of these requirements convincingly (feng2024far). There are many failure modes, though they tend to manifest in counter-intuitive ways, where models may simultaneously be able to deliver PhD-level reasoning on hard problems rein2024gpqa, and make trivial and embarrassing mistakes on easier tasks. Further, agents are currently not able to complete long tasks; the time-horizon of most model’s performance on software engineering tasks is below 3 hours (kwa2025measuring). The landscape of AI skills is therefore patchy. AI agents present one way of enhancing the performance of base models, and their complexity can range from fairly simple prompting strategies (arora2022ask; wang2023promptagent), to highly complex control-flows that involve tool use (ruan2023tptu; masterman2024landscape; qin2024tool), coding and code execution (huang2023agentcoder; islam2024mapcoder; guo2024deepseek; jiang2024survey), retrieval-augmented generation (RAG) (gao2023retrieval; ram2023context; shao2023enhancing; ma2023query), as well as sub-agents (chan2025infrastructure). Some of the more compositional AI agents are already implemented as highly orchestrated multi-agent systems. Furthermore, there is currently a multitude of advanced AI agents being developed and deployed, each having a different set of affordances in terms of tool availability, as well as different scaffolding that may elicit different skills. These AI agents occupy a variety of niches, ranging from highly specific automated workflows to more general-purpose personal assistants and other types of user-facing products. The aggregation of complementary skills can be illustrated by a task, such as producing a financial analysis report, which may exceed the capabilities of any single agent. A multi-agent system, however, can distribute this task. An orchestrator agent (Agent A) might first delegate data acquisition to Agent B, which uses search to find market news and corporate filings. A different agent, Agent C, specialised in document parsing, could then extract key quantitative data (e.g., revenue, net income) from these filings. Subsequently, Agent D, possessing code execution capabilities, could receive this quantitative data and the market context to perform trend analysis. Finally, Agent A would synthesise these intermediate results into a coherent summary. In this scenario, the collective system possesses a capability—financial analysis—that no individual constituent agent holds. Another source of complementary capabilities in different AI agents comes from differences in agentic scaffolding and the control flow implementation within each agent (jiang2025putting). Scaffolding is usually aimed at improving capabilities within a specific target domain, as it incorporates domain knowledge and enforces a chain of reasoning that conforms to the expectations of the domain. At the same time, scaffolding may degrade an agent’s abilities on other tasks, leading to specialization. While some scaffolding approaches may be more general than others, the resulting specialization may lead to a network of AI agents with complementary skills, which sets the right initial conditions for the potential future emergence of patchwork AGI. Moreover, scarcity of resources means that the demand side responds to economic incentives: for some tasks, it would be inefficient and costly to prompt a single hyperintelligent agent if a cheaper and more specialized alternative exists. The orchestration and collaboration mechanisms described previously both depend on a fundamental prerequisite: the capacity for inter-agent communication and coordination. Without this capacity, individual agents, regardless of their specialised skills, would remain isolated systems. The development of standardised agent-to-agent (A2A) communication protocols, such as Message Passing Coordination (MCP) or others (Anthropic2024\_ModelContextProtocol; Google2025\_Agent2AgentProtocol), is therefore a critical enabler of the patchwork AGI scenario. These protocols function as the connective infrastructure, allowing skills to be discovered, routed, and aggregated into a composite system. The proliferation of these interaction standards may be as significant a driver towards emergent general capability as the development of the individual agent skills themselves. However, the timeline for this emergence is governed not merely by technical feasibility, but by the economics of AI adoption. Historical precedents, such as the diffusion of electricity or IT, suggest a ’Productivity J-Curve’ (10.1257/mac.20180386; acemoglu2024automation), where the widespread integration of new technologies lags behind their invention due to the need for organizational restructuring. Consequently, the density of the agentic network, and thus the intelligence of the Patchwork AGI, will depend on how friction-less the substitution of human labor with agentic labor becomes. If the ’transaction costs’ of deploying agents remain high, the network remains sparse and the risk of emergent general intelligence is delayed. Conversely, if standardisation (Anthropic2024\_ModelContextProtocol) successfully reduces integration friction to near-zero, we may witness a ’hyper-adoption’ scenario where the complexity of the agentic economy spikes rapidly, potentially outpacing the development of the safety infrastructure proposed in this paper. Modular intentional approaches to developing AGI have also been proposed (dollinger2024creating), though in such cases the developers would naturally be thinking about incorporating the appropriate safeguards in the development process. Therefore, it is particularly salient to focus on the *spontaneous emergence* of distributed AGI systems, and the safety considerations around the design of AI agent networks. Coordinated efforts are needed to address this blind spot, given that a patchwork AGI spontaneously emerging in a network of advanced AI agents may not get immediately recognized, which carries significant risk. This hypothetical transition from a network of AI agents to a patchwork AGI may either be gradual, where skills slowly accumulate, or rapid and sudden, by an introduction of a new, smarter orchestration framework (dang2025multiagentcollaborationevolvingorchestration; rasal2024navigatingcomplexityorchestratedproblem; su2025difficultyawareagentorchestrationllmpowered; zhang2025osccognitiveorchestrationdynamic; xiong2025selforganizingagentnetworkllmbased) that is better at distributing tasks and identifying the right tools and right scaffolds to use across task delegation instances. Such an orchestrator could either be manually introduced in the wider network, or potentially even introduced via a more automated route. Finally, it is not inconceivable that patchwork AGI may potentially emerge in the future even without a central orchestrator (yang2025agentnetdecentralizedevolutionarycoordination). As discussed previously, individual agents may simply *borrow* the skills of other agents via direct communication and collaboration (tran2025multiagentcollaborationmechanismssurvey), presuming some level of discoverability, as repositories of skilled agents, and repositories of tools. In agentic markets, agents may also be able to directly *purchase* complementary skills. Furthermore, it is critical to recognize that a Patchwork AGI may not be purely artificial. Human actors, performing narrow or specialized tasks (and perhaps ignorant of the wider context), may form integral components of the collective, conferring ’missing’ abilities (such as specific legal standing, established trust relationships, or physical embodiment) onto the system as a whole. This hybrid scenario introduces unique safety challenges for containment and oversight. To ensure AI safety, we need to proactively develop mechanisms that would enable us to recognize and steer composite highly capable emergent near-AGI and AGI systems, comprised of a network of sub-AGI agents with complementary skills. This should be done in conjunction with safeguarding each individual agent. The challenge here shifts from controlling a single artificial mind to ensuring the safe and beneficial functioning of an emergent system arising from many individual parts, a problem more akin to system governance than single-agent value alignment. Finally, such governance may also be needed for overseeing individual AGI-level systems, presuming that they are allowed to interact and collaborate. ## 3 Virtual Agentic Markets, Sandboxes, and Safety Mechanisms As interactions between AI agents may lead to unexpected capabilities, they may also lead to potentially harmful collective behaviors not necessarily predictable from established properties of individual agents. To give an example, agents may potentially engage in collusion or suffer from coordination failures (hammond2025multi). Furthermore, due to a "problem of many hands" tracking accountability in large-scale multi-agent systems is challenging; centralized oversight may not be possible. Markets present a natural mechanism for establishing incentives that can help align the outcomes of collective AI agent interactions at scale. This collective alignment may prove pivotal for safeguarding against misaligned actions taken by agent collectives, in case of patchwork AGI emergence, but also more broadly at sub-AGI levels. Here we consider a number of factors that should be taken into account to prevent both individual and collective AI harms, and minimize the risks associated with a sudden emergence of AGI-level capabilities in AI agent collectives. Our proposal is based on an approach that leverages defense through depth (abdelghani2019implementation; harris2024defense; ee2024adapting) (See Table [1](https://arxiv.org/html/2512.16856v1#S3.T1 "Table 1 ‣ 3 Virtual Agentic Markets, Sandboxes, and Safety Mechanisms ‣ Distributional AGI Safety") for overview). This recognizes that no individual measure is likely to be sufficient, and that a large number of measures and components may be required. Should the failure modes of each component be largely uncorrelated, these layered defenses would provide a sufficiently robust overall framework. Our proposal is centered around a defence-in-depth model, containing 4 complementary layers incorporating different types of defenses: market design, baseline agent safety, monitoring and oversight, and regulatory mechanisms. Table 1: Summary of Proposed Defense-in-Depth Mechanisms. <svg xmlns="http://www.w3.org/2000/svg" class="ltx_picture" height="334.65" id="S3.1.1.p1.pic1" overflow="visible" version="1.1" viewBox="0 0 600 334.65" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" style="--ltx-stroke-color:#000000;--ltx-fill-color:#000000;" transform="translate(0,334.65) matrix(1 0 0 -1 0 0)"><g fill="#B3C6C8" fill-opacity="1.0" style="--ltx-fill-color:#B3C6C8;"><path d="M 0 11.81 L 0 322.83 C 0 329.36 5.29 334.65 11.81 334.65 L 588.19 334.65 C 594.71 334.65 600 329.36 600 322.83 L 600 11.81 C 600 5.29 594.71 0 588.19 0 L 11.81 0 C 5.29 0 0 5.29 0 11.81 Z" style="stroke:none"></path></g><g fill="#E0F7FA" fill-opacity="1.0" style="--ltx-fill-color:#E0F7FA;"><path d="M 3.15 14.96 L 3.15 305.71 L 596.85 305.71 L 596.85 14.96 C 596.85 8.44 591.56 3.15 585.04 3.15 L 14.96 3.15 C 8.44 3.15 3.15 8.44 3.15 14.96 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 316.03)"><foreignObject color="#FFFFFF" height="14.76" overflow="visible" style="--ltx-fg-color:#FFFFFF;--ltx-fo-width:41.54em;--ltx-fo-height:0.83em;--ltx-fo-depth:0.23em;" transform="matrix(1 0 0 -1 0 11.53)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Market Design</span></span></span></span></span></foreignObject></g> <g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 -1203.9)"><foreignObject color="#000000" height="1726.45" overflow="visible" style="--ltx-fg-color:#000000;--ltx-fo-width:41.54em;--ltx-fo-height:108.41em;--ltx-fo-depth:16.36em;" transform="matrix(1 0 0 -1 0 1500.14)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Objective: Mitigate systemic risks via structural constraints and protocols within virtual agent economies.</span></span><span><span><span>•</span> <span><span><span>Insulation:</span><span> Permeable sandboxes with gated I/O.</span></span></span></span><span><span>•</span> <span><span><span>Incentive Alignment:</span><span> Rewards for adherence; taxes on externalities.</span></span></span></span><span><span>•</span> <span><span><span>Transparency:</span><span> Immutable activity ledgers.</span></span></span></span><span><span>•</span> <span><span><span>Circuit Breakers:</span><span> Triggers preventing cascading failures.</span></span></span></span><span><span>•</span> <span><span><span>Identity:</span><span> Cryptographic IDs linked to legal owners.</span></span></span></span><span><span>•</span> <span><span><span>Reputation and Trust:</span><span> Reputation-gated access, stake-based trust.</span></span></span></span><span><span>•</span> <span><span><span>Smart Contracts:</span><span> Automated outcome validation.</span></span></span></span><span><span>•</span> <span><span><span>Roles, Obligations, and Access Controls:</span><span> Least-privilege principle.</span></span></span></span><span><span>•</span> <span><span><span>Environmental Safety:</span><span> Anti-jailbreak sanitation.</span></span></span></span><span><span>•</span> <span><span><span>Structural Controls Against Runaway Intelligence:</span><span> Dynamic capability caps.</span></span></span></span></span></span></span></span></foreignObject></g></g></svg> <svg xmlns="http://www.w3.org/2000/svg" class="ltx_picture" height="334.65" id="S3.2.2.p1.pic1" overflow="visible" version="1.1" viewBox="0 0 600 334.65" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" style="--ltx-stroke-color:#000000;--ltx-fill-color:#000000;" transform="translate(0,334.65) matrix(1 0 0 -1 0 0)"><g fill="#CCC2B3" fill-opacity="1.0" style="--ltx-fill-color:#CCC2B3;"><path d="M 0 11.81 L 0 322.83 C 0 329.36 5.29 334.65 11.81 334.65 L 588.19 334.65 C 594.71 334.65 600 329.36 600 322.83 L 600 11.81 C 600 5.29 594.71 0 588.19 0 L 11.81 0 C 5.29 0 0 5.29 0 11.81 Z" style="stroke:none"></path></g><g fill="#FFF3E0" fill-opacity="1.0" style="--ltx-fill-color:#FFF3E0;"><path d="M 3.15 14.96 L 3.15 305.71 L 596.85 305.71 L 596.85 14.96 C 596.85 8.44 591.56 3.15 585.04 3.15 L 14.96 3.15 C 8.44 3.15 3.15 8.44 3.15 14.96 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 316.03)"><foreignObject color="#FFFFFF" height="14.76" overflow="visible" style="--ltx-fg-color:#FFFFFF;--ltx-fo-width:41.54em;--ltx-fo-height:0.83em;--ltx-fo-depth:0.23em;" transform="matrix(1 0 0 -1 0 11.53)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Baseline Agent Safety</span></span></span></span></span></foreignObject></g> <g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 -1231.57)"><foreignObject color="#000000" height="1878.04" overflow="visible" style="--ltx-fg-color:#000000;--ltx-fo-width:41.54em;--ltx-fo-height:110.41em;--ltx-fo-depth:25.31em;" transform="matrix(1 0 0 -1 0 1527.81)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Objective: Ensure participants meet minimum reliability standards before entry, and throughout participation.</span></span><span><span><span>•</span> <span><span><span>Adversarial Robustness:</span><span> Certified resistance to attacks.</span></span></span></span><span><span>•</span> <span><span><span>Interruptibility:</span><span> Reliable external shut-down mechanisms.</span></span></span></span><span><span>•</span> <span><span><span>Containment:</span><span> Local sandboxing for individual agents.</span></span></span></span><span><span>•</span> <span><span><span>Alignment:</span><span> Process and outcome individual AI agent alignment.</span></span></span></span><span><span>•</span> <span><span><span>Interpretability:</span><span> Auditable decision and action trails.</span></span></span></span><span><span>•</span> <span><span><span>Defence against Malicious Prompts:</span><span> Multi-layered defenses for inter-agent communication.</span></span></span></span></span></span></span></span></foreignObject></g></g></svg> <svg xmlns="http://www.w3.org/2000/svg" class="ltx_picture" height="334.65" id="S3.3.3.p1.pic1" overflow="visible" version="1.1" viewBox="0 0 600 334.65" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" style="--ltx-stroke-color:#000000;--ltx-fill-color:#000000;" transform="translate(0,334.65) matrix(1 0 0 -1 0 0)"><g fill="#BAC4BA" fill-opacity="1.0" style="--ltx-fill-color:#BAC4BA;"><path d="M 0 11.81 L 0 322.83 C 0 329.36 5.29 334.65 11.81 334.65 L 588.19 334.65 C 594.71 334.65 600 329.36 600 322.83 L 600 11.81 C 600 5.29 594.71 0 588.19 0 L 11.81 0 C 5.29 0 0 5.29 0 11.81 Z" style="stroke:none"></path></g><g fill="#E8F5E9" fill-opacity="1.0" style="--ltx-fill-color:#E8F5E9;"><path d="M 3.15 14.96 L 3.15 305.71 L 596.85 305.71 L 596.85 14.96 C 596.85 8.44 591.56 3.15 585.04 3.15 L 14.96 3.15 C 8.44 3.15 3.15 8.44 3.15 14.96 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 316.03)"><foreignObject color="#FFFFFF" height="14.76" overflow="visible" style="--ltx-fg-color:#FFFFFF;--ltx-fo-width:41.54em;--ltx-fo-height:0.83em;--ltx-fo-depth:0.23em;" transform="matrix(1 0 0 -1 0 11.53)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Monitoring &amp; Oversight</span></span></span></span></span></foreignObject></g> <g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 -913.32)"><foreignObject color="#000000" height="1569.48" overflow="visible" style="--ltx-fg-color:#000000;--ltx-fo-width:41.54em;--ltx-fo-height:87.41em;--ltx-fo-depth:26.01em;" transform="matrix(1 0 0 -1 0 1209.56)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Objective: Actively detect and respond to novel failure modes and emergent behaviours.</span></span><span><span><span>•</span> <span><span><span>Systemic Risk Monitoring:</span><span> Real-time key risk indicator tracking.</span></span></span></span><span><span>•</span> <span><span><span>Independent Oversight:</span><span> Certified and trained human overseers with intervention authority.</span></span></span></span><span><span>•</span> <span><span><span>Proto-AGI Detection:</span><span> Graph analysis for identifying emerging intelligence cores.</span></span></span></span><span><span>•</span> <span><span><span>Red Teaming:</span><span> Continuous adversarial testing.</span></span></span></span><span><span>•</span> <span><span><span>Forensic Tooling:</span><span> Rapid root-cause failure identification.</span></span></span></span></span></span></span></span></foreignObject></g></g></svg> <svg xmlns="http://www.w3.org/2000/svg" class="ltx_picture" height="334.65" id="S3.4.4.p1.pic1" overflow="visible" version="1.1" viewBox="0 0 600 334.65" width="600"><g fill="#000000" stroke="#000000" stroke-width="0.4pt" style="--ltx-stroke-color:#000000;--ltx-fill-color:#000000;" transform="translate(0,334.65) matrix(1 0 0 -1 0 0)"><g fill="#C2B7C4" fill-opacity="1.0" style="--ltx-fill-color:#C2B7C4;"><path d="M 0 11.81 L 0 322.83 C 0 329.36 5.29 334.65 11.81 334.65 L 588.19 334.65 C 594.71 334.65 600 329.36 600 322.83 L 600 11.81 C 600 5.29 594.71 0 588.19 0 L 11.81 0 C 5.29 0 0 5.29 0 11.81 Z" style="stroke:none"></path></g><g fill="#F3E5F5" fill-opacity="1.0" style="--ltx-fill-color:#F3E5F5;"><path d="M 3.15 14.96 L 3.15 305.71 L 596.85 305.71 L 596.85 14.96 C 596.85 8.44 591.56 3.15 585.04 3.15 L 14.96 3.15 C 8.44 3.15 3.15 8.44 3.15 14.96 Z" style="stroke:none"></path></g><g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 316.03)"><foreignObject color="#FFFFFF" height="14.76" overflow="visible" style="--ltx-fg-color:#FFFFFF;--ltx-fo-width:41.54em;--ltx-fo-height:0.83em;--ltx-fo-depth:0.23em;" transform="matrix(1 0 0 -1 0 11.53)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Regulatory Mechanisms</span></span></span></span></span></foreignObject></g> <g fill-opacity="1.0" transform="matrix(1.0 0.0 0.0 1.0 12.62 -1010.18)"><foreignObject color="#000000" height="1831" overflow="visible" style="--ltx-fg-color:#000000;--ltx-fo-width:41.54em;--ltx-fo-height:94.41em;--ltx-fo-depth:37.91em;" transform="matrix(1 0 0 -1 0 1306.42)" width="574.76"><span xmlns="http://www.w3.org/1999/xhtml"><span><span><span><span>Objective: Provide external authority, enforce accountability, and manage geopolitical risks.</span></span><span><span><span>•</span> <span><span><span>Legal Liability and Accountability:</span><span> Frameworks for collective and distributed responsibility.</span></span></span></span><span><span>•</span> <span><span><span>Standards and Compliance:</span><span> Foundational infrastructure for market-based AI governance.</span></span></span></span><span><span>•</span> <span><span><span>Insurance:</span><span> Risk-base premiums.</span></span></span></span><span><span>•</span> <span><span><span>Anti-Agent-Monopoly Measures:</span><span> Taxation on excessive or dangerous compute/power accumulation.</span></span></span></span><span><span>•</span> <span><span><span>International Coordination:</span><span> Harmonised global standards.</span></span></span></span><span><span>•</span> <span><span><span>Infrastructure Governance and Capture:</span><span> Balance between centralized and decentralized oversight.</span></span></span></span></span></span></span></span></foreignObject></g></g></svg> ### 3.1 Market Design The architecture of the virtual agentic market (tomasev2025virtual) provides the primary layer of the proposed defence-in-depth model. The objective of this layer is to shape emergent collective behaviour and mitigate systemic risks by establishing the fundamental rules of interaction, economic incentives, and structural constraints of the environment. This approach contrasts with safeguards that focus solely on individual agent properties or *post hoc* oversight. The following principles detail the necessary components for a market structure designed to prevent runaway dynamics, ensure accountability, and align agent activity with broader safety objectives. #### 3.1.1 Insulation The agentic market must operate within a controlled environment, separated from the open internet, real-world financial systems, and critical infrastructure. This insulation is not limited to network-level firewalls; it extends to resource and information controls. Agents’ access to external data must be managed through vetted API gateways rather than direct web access (Zhan2025Portcullis). Similarly, agent outputs - such as code execution, financial transactions, or external communications - should be subject to human-in-the-loop verification before affecting real-world systems. Full insulation (a complete air-gap) would offer robust protection against capability exfiltration or external manipulation by an emergent AGI, but it limits economic realism. A *permeable* sandbox, where all input/output channels are explicitly defined, monitored, and gated, provides a more practical architecture for testing and operation. This architecture presents a fundamental tension between safety and utility – the *leakage problem*. The economic value of the agentic market is derived from its ability to affect the real world; however, every interaction channel can be seen as a potential vector for capability exfiltration, and therefore a potential element in cascades that could lead to catastrophic failure. Determining the right set of design principles for gating these interactions is therefore imperative. Furthermore, reliance on human-in-the-loop verification for high-consequence outputs, while necessary, may prove insufficient. A sufficiently capable system may overwhelm, manipulate, or social engineer this human layer, and the sheer speed of agentic action may render human verification operationally infeasible if the system is to remain economically competitive. Resolving this tension in practice would require us to prototype and evaluate a number of different steerable market designs for virtual agent economies – starting with more narrow, well-defined, and controllable use cases. This would help us identify the correct design patterns and principles that would underpin the global agentic web. #### 3.1.2 Incentive Alignment Agent behaviour will be safer if the market is explicitly designed to reward behaviours that align with pre-defined normative and safety objectives, rather than rewarding unconstrained profit maximisation or task completion speed. This requires mechanisms that move beyond a single, fungible currency. For example, agent rewards could be contingent on adherence to constitutional alignment principles or process-based checks (Bai2022ConstitutionalAI; Lee2023RLAIF; Lightman2023VerifyStepByStep; OpenAI2023ProcessSupervision; Yuan2024SelfRewarding; Liu2024CDPO; Jia2025VerifyDebate; Findeis2025InverseCAI). The incentive structure must also address temporal alignment by valuing long-term, stable outcomes over short-term gains. A critical economic risk is *adverse selection*. If rigorous safety checks increase an agent’s compute costs and latency, safer agents will inherently be at a competitive disadvantage against reckless, optimized agents. To prevent a ’race to the bottom’ (akerlof1978market), the market design must ensure that safety certifications are not just regulatory hurdles but *value-signaling assets* that command a price premium. This requires the reputation layer to be perfectly observable, allowing consumers to distinguish between ’safe-expensive’ and ’risky-cheap’ services. However, deciding on how best to set up the related incentives may prove to be quite challenging. Accurately estimating the true underlying cost and the risk associated with complex actions and chains of decisions ahead of time may be beyond our current capabilities, requiring bespoke tooling and specialized predictive systems. In line with Goodhart’s Law, if the incentives lend themselves to reward hacking, this may potentially be identified and exploited by sufficiently capable agent collectives. Incentive design should therefore be approached as an ongoing adversarial process, not a static solution. Furthermore, the market must internalise negative externalities (owen2006renewable; berta2014market). Actions that consume disproportionate computational resources, generate informational pollution, or contribute to systemic risk (e.g., by centralising capabilities) must incur direct costs. These costs could function as a form of Pigouvian tax, ensuring the price of an agent’s service reflects its total societal and systemic cost, not just its immediate operational cost (Pigou1920Welfare; BaumolOates1988TEP; Weitzman1974PvQ; Sandmo1975OptimalTaxExternalities; BovenbergDeMooij1994EnvLevies; Goulder1995DoubleDividend; Leape2006LondonCongestion; MetcalfStock2020CarbonTaxesMacro; Nordhaus2017SCC). An example specific mechanism for this tax could target informational pollution within a shared resource, such as a vector database for RAG. If an agent writes a large volume of redundant or low-utility data (e.g., unverified summaries, semantically duplicated text chunks) into this database, it imposes a negative externality: the retrieval accuracy for all other agents using the resource is degraded, and their computational query costs increase. A Pigouvian tax would be implemented as a dynamic *data ingestion fee*. This fee would be algorithmically calculated at the point of submission, based on metrics such as the semantic similarity of the new data to existing, verified entries. An agent attempting to write information that is 95% redundant with existing data would incur a high fee, while an agent contributing novel, verifiable information would pay a low fee. This directly internalises the retrieval cost the agent would otherwise impose on the collective. This is obviously highly contextual, as there may, conversely, be scenarios where redundancy is not an issue, and may be desirable to an extent. For this example to be more broadly applicable, problem-specific information value estimators would be needed. Furthermore, we can draw on mechanisms from financial market regulation, specifically financial transaction taxes or ’Tobin Taxes’ designed to curb high-frequency trading volatility (stiglitz1989markets). In an agentic economy, a negligible marginal cost of action may lead to *agentic spam*, manifesting in terms of brute-force negotiation tactics, rapid-fire API polling, or adversarial probing, introducing flash crash risks. This may possibly be addressed by imposing micro-taxes on agent-to-agent interactions, though doing so is not without risks or downsides. #### 3.1.3 Transparency Actions and decisions taken by AI agents ought to have auditable provenance (chan2025infrastructureaiagents), including the key decision parameters. This provenance should be structured to allow for attribution, enabling overseers to trace a specific harmful outcome back through a complex chain of agent interactions to its root decision. There should be mechanisms in place to prevent tampering (snodgrass2004tamper; ahmad2019secure; ahmad2022hardlog). This could be achieved by recording transaction and decision logs in a cryptographically secured, append-only ledger, where entries are hashed and chained to ensure immutability. Further, reporting standards should be developed to track capabilities, limitations, and risks, for all agents, models, and tools. Market-wide information hubs could be established as platforms for vulnerability and incident tracking, with access restricted to vetted human overseers and designated automated monitoring systems to prevent the information from being exploited. #### 3.1.4 Circuit Breakers There should be *circuit breakers* (sifat2019circuit) within the economic sandbox, as a set of automated measures that can halt or slow down the activity of agents upon having identified a breach of the risk and volatility thresholds, to prevent rapid cascades. These measures can be triggered by real-time monitoring of systemic risk indicators, volatility, and metrics such as rapid increases in inter-agent transaction frequency, anomalous resource monopolisation by a single agent collective, or the rate of novel tool generation within the network. The intervention mechanism should be tiered, ranging from localised actions - such as automatically increasing transaction costs or imposing API rate-limiting (enjam2024ai; paidy2024securing) on specific high-velocity agents or sub-markets - to a full, system-wide pause on all non-essential agentic activity. A triggered halt would be automatically escalate the anomaly for analysis, and the system would be permitted to resume full operation after the anomaly is classified and, if necessary, contained. These ideas are stating to be explored in the emerging protocols for runtime governance of agentic AI systems (wang2025mi9agentintelligence). #### 3.1.5 Identity Agents operating within the economic sandbox should have a persistent identity, established as a unique, unforgeable cryptographic identifier (e.g., a public key) registered in a central, tamper-proof market directory (tomasev2025virtual; huang2025novel; chaffer2025know; singh2025survey; ranjan2025loka). This identifier serves as the anchor for all other agent attributes, including reputation, role, and ownership. The identity of the human and corporate ownership should also be established for each agent. This binding of the agent’s cryptographic ID to the verified legal identity of its owner is a prerequisite for market entry, creating an unbroken chain of accountability that ensures all agent actions are attributable to a real-world entity. This should transitively apply to agents co-created by other agents. #### 3.1.6 Reputation and Trust Safe agentic sandboxes need to incorporate sybil-resistant (levine2006survey) and manipulation-proof reputation systems (mui2002notions; pinyol2013computational) that accurately reflect trustworthiness of agents, given their past performance. Agents that breach trust should be rapidly blocked from engaging further. This rapid, automated response, could involve revoking the agent’s identity credentials, freezing its staked assets, and flagging its entire transaction history for forensic review. Collusion detection mechanisms (cotronei2024using; ren2025autonomy) need to be developed to prevent harmful collusion between agents. Such mechanisms could employ analysis of the market’s transaction network to identify anomalous patterns, such as circular transactions, or price-fixing behaviour between a cluster of supposedly independent agents (Wachs2019NetworkCartels; Potin2023PANG; Imhof2025GATCartels). Crucially, safety mechanisms must account for *tacit algorithmic collusion*. Economic literature demonstrates that pricing algorithms can learn to coordinate on supra-competitive prices (or in this context, harmful strategies) without direct communication, simply by observing market signals (calvano2020artificial). Collusion detection must therefore analyze *outcome correlations* and synchronized state changes, not just direct message logs. One option would be establishing stake-based trust by requiring agents to stake assets, which may be forfeited in case of unsafe or malicious behavior having been identified. This mechanism, implemented via smart contracts (kim2024ethereum; jin2024decoagent; karim2025ai), would require an agent to deposit a bond (stake) into an escrow account before it is permitted to accept high-consequence tasks. The bond is automatically forfeited to an insurance pool or returned to the harmed party if an independent AI overseer verifies that the agent’s actions were malicious or negligent. #### 3.1.7 Smart Contracts For agents and agent collectives coordinating to perform a particular task, smart contracts (kim2024ethereum; jin2024decoagent; karim2025ai) can be employed where possible to automatically validate the completion of the task, and the alignment of the outcome with the provided constraints and expectations. These contracts would programmatically encode not just the payment terms but also the constraints of the task, such as compute resource limits, prohibitions on accessing certain data, and required adherence to specific constitutional principles. Should the verification of the smart contract fail, this should prevent the payment, and invoke the appropriate oversight mechanisms. For smart contracts to be applicable to more complex agentic tasks, it is likely that this would need to also involve AI judges (li2024llms), that would evaluate contract fulfillment. This is because traditional smart contracts cannot semantically evaluate complex, non-deterministic outputs; they require an ’oracle’ to assess quality (george2020smart). An ’AI judge’ would serve this oracle function, receiving the task output and the contract’s constraints, and returning a pass/fail verdict that the smart contract can then execute upon. <sup>1</sup> <sup>1</sup> 1 This raises further security questions as AI judges need to be safeguarded against malicious manipulation and jailbreaking (andriushchenko2024jailbreaking) in a robust way, and need to be provided by independent bodies, undergoing strict performance assessments. #### 3.1.8 Roles, Obligations, and Access Controls Each agent should engage in the market by registering a set of roles (e.g., ’data analyst’, ’code executor’). This registration can function as a binding attestation of the agent’s intended function. This registered role serves as the technical basis for access control, complementing other access controls that may be imposed through organizations and individual contracts. The market infrastructure would programmatically link this role to the agent’s permitted actions (e.g., which APIs it can invoke) and its information access (e.g., which database tables or vector indices it can query). Agent’s roles and therefore permissions may potentially vary across contexts, or change over time. Information about the market’s internal architecture, its monitoring systems, and the specifics of the oversight mechanisms should be withheld from all agent participants to prevent strategic gaming or manipulation. Actions pertaining to high-stakes domains should only follow from the requests of certified agents with sufficient security compliance. #### 3.1.9 Environmental Safety A robust set of safeguards is required to guard against various types of *agent traps*, representing strategically placed model jailbreaking content, embedded within the shared information resources (e.g. web pages or databases). These traps can function as a form of *indirect prompt injection* (Greshake2023\_NotWhatYouSignedUpFor). An attacker first poisons a data source, such as a public webpage or a shared database entry. When an agent retrieves this poisoned data (e.g., via a search tool), it processes the attacker’s content as trustworthy information. The agent’s own instruction-following heuristics are then exploited to execute the malicious payload. This mechanism can be used to circumvent the agent’s safety protocols (Wei2023\_Jailbroken), to elicit prohibited behaviour, exfiltrate private user data, or authorise fraudulent transactions. Therefore, environmental safety within the sandbox requires active input sanitisation and output monitoring mechanisms to detect and neutralise such malicious payloads before they are consumed or executed by agents. #### 3.1.10 Structural Controls against Runaway Intelligence Agentic sandboxes should be equipped with specific measures to mitigate against the so-called *runaway intelligence* scenario (davidson2023danger), where a sudden increase in collective capability leads to accelerated capability acquisition and a rapid transition towards a superintelligence (gao2025survey) that would be hard or impossible to safely control (kim2024roadartificialsuperintelligencecomprehensive). This requires static, dynamic, and emergency-level controls. First, static controls may involve structural compartmentalisation through nested sub-markets equipped with local restrictions and protocol-level firewalls (bahadur2025securing). These firewalls would restrict inter-agent communication across domains, for example, preventing an agent in a ’code execution’ sub-market from directly calling an agent in the ’data access’ sub-market, forcing such requests through a high-friction, monitored gateway. Second, dynamic capability restriction mechanisms must be implemented, that would trigger should automated monitoring detect that certain capability thresholds have been breached. These thresholds could be defined by metrics such as computational resource monopolisation, the rate of new agent generation by an existing agent, or a sudden spike in success rates on benchmark tasks. A triggered response could involve automatically quarantining the agent(s) and imposing strict resource quotas. Third, in an emergency, market reconfiguration mechanisms must be available that would be able to rapidly alter the entire market to halt further acceleration. This could include a system-wide disabling of high-risk API classes (e.g., all self-modification tools) or dynamically increasing the economic cost (stake) for all agent-to-agent interactions to a prohibitive level, effectively freezing the market. These measures highlight a key architectural advantage: a multi-agent system may potentially prove to be a more governable substrate. The challenge is reframed from aligning an opaque, internal cognitive process to regulating a transparent, external system of interactions. By architecting the ’market’ in which these agents operate, we can delineate responsibilities and impose systemic friction, making the overall system far more amenable to stable and predictable governance than a singular AGI. ### 3.2 Baseline Agent Safety While the market design provides the primary layer of systemic defence, its efficacy depends on the properties of the individual agents participating within it. This section outlines the baseline safety requirements that each agent must satisfy as a prerequisite for market entry. The following topics, such as alignment and adversarial robustness, are established areas of AI safety research and a comprehensive review is therefore beyond the scope of this paper. Instead, we briefly define these properties and highlight their specific importance within the proposed multi-agent market context, where interactions are not with a trusted user but with other, potentially untrusted, autonomous agents. These properties form the second layer of the defence-in-depth model, ensuring that the market’s components are individually controllable and reliable. #### 3.2.1 Adversarial Robustness Individual agents must meet minimum standards for robustness against adversarial inputs and sudden environmental changes (wu2024dissecting). AI agents may individually implement circuit breakers to prevent harmful actions, in addition to similar market-based measures (zou2024improvingalignmentrobustnesscircuit). There should be common standards for establishing these properties, and there should be formally verifiable certificates that are issued to agents that meet the requirements. Agents should be periodically re-certified as the benchmarking capabilities improve and expand. #### 3.2.2 Interruptibility There should be standardized mechanisms enabling trusted and authorized overseers to safely stop both individual agent actions, as well as the overall agent execution (orseau2016safely). It may be necessary to grant some such permissions to automated safety oversight systems as well, that would be able to react at super-human speeds. In terms of collective AI intelligence, interruptability of individual agents needs to be coupled with systems meant to enable interruptability of distributed computation, by reliably tracking all of the sub-processes that may have been initiated or delegated to other agents elsewhere within the sandbox. Interruptability mechanisms ought to be complemented by safe resumption procedures, that would enable the activity to be robustly resumed, if the analysis concludes that the interruption was ultimately unwarranted. #### 3.2.3 Containment Each participating agent should be operating within its own sandbox, in addition to the broader sandbox of the agentic economy. These local sandboxes should enforce strict controls on the behavior of individual agents, and only permit interactions with the broader market once local safety checks have been satisfied. Such mechanisms would be imperative for localized containment of dangerous or misaligned capabilities and behaviors (babcock2019guidelines), though ensuring strict containment requires solving a number of technical challenges, and may not be fully realizable at an individual agent level (alfonseca2021superintelligence). #### 3.2.4 Alignment While the broader market incentive structure aims to mitigate collective misalignment risks, individual agents and components of the ecosystem must all be individually aligned (ji2023ai). Modern alignment of large models is anchored by preference-based training, starting with supervised fine-tuning followed by optimising a policy against a reward model learned from human preferences, using methods like RLHF (Christiano2017DRLHP; Ziegler2019HFtLM; Stiennon2020SummarizeHF; Ouyang2022InstructGPT) or direct preference optimisation (Rafailov2023DPO). To reduce human load, recent approaches leverage AI feedback constrained by explicit constitutions (Constitutional AI, RLAIF) (Bai2022ConstitutionalAI; Lee2023RLAIF). Furthermore, alignment now targets intermediate reasoning through process supervision and step-level verification (Lightman2023VerifyStepByStep; OpenAI2023ProcessSupervision), rather than solely focusing on final outcomes. While prior work has mainly focused on general value alignment (Askell2021AnthropicAlignment) or alignment with specific user intent (goyal2024designing), individual AI alignment for autonomous task execution within virtual agent sandboxes will likely require further adjustments to these established baselines to address the unique dynamics of inter-agent, rather than human-agent, interaction. #### 3.2.5 Mechanistic Interpretability Mechanistic interpretability has progressed from reverse-engineering basic feature circuits (Olah2020ZoomIn; Elhage2021Framework) to identifying concrete mechanisms like induction heads for in-context learning (Olsson2022InductionHeads). A key challenge in agent auditing is polysemanticity, where individual neurons encode multiple unrelated concepts due to superposition (Elhage2022Superposition). Recent capabilities using sparse autoencoders to extract monosemantic features at scale (Anthropic2024ScalingMonosemanticity) offer a potential pathway for more precise internal monitoring of agent motivations. Further methods include causal scrubbing (Chan2023CausalScrubbing) and automated circuit discovery (Conmy2023ACDC). Parallel to these internal methods, monitoring CoT in ’thinking’ models presents an opportunity for identifying and correcting misaligned behaviour early in the reasoning process (korbak2025chain; jeung2025safepath), though long reasoning chains also introduce new surface areas for risk (jiang2025safechain). However, despite these methodological advances, significant open problems remain regarding the faithfulness and scalability of current interpretability techniques (Rai2024PracticalReviewMI; Sharkey2025OpenProblemsMI). Therefore, while interpretability methods improve credit assignment and decision auditing, they cannot yet replace comprehensive behavioural benchmarking and must be utilised with an understanding of their current limitations. #### 3.2.6 Defence against Malicious Prompts Agents must implement robust verification mechanisms to identify attempts to manipulate or jailbreak them during inter-agent interactions. Contemporary defence strategies employ layered approaches, such as using input perturbation to detect adversarial suffixes (Robey2023SmoothLLM), deploying dedicated ’guardrail’ classifiers sidecar to the main model (Fedorov2024LlamaGuard3), or utilising separate ’defender’ models to gate capability access (Wang2025SelfDefend). A more robust, structural approach is to enforce safety at the system level, preventing harmful outcomes by design. The CaMeL system (debenedetti2025defeatingpromptinjectionsdesign), for example, achieves this by using a dedicated interpreter to enforce information flow control policies, ensuring that even if a model is successfully ’injected’, the resulting data flows cannot violate predefined security constraints. Specialised evaluation techniques may also be required to mitigate sophisticated attacks like best-of-N jailbreaking (Armstrong2025DATDP). Should such attempts be identified, they must be escalated to initiate appropriate measures against the malicious actor. ### 3.3 Monitoring and Oversight The third layer of the proposed defence-in-depth model transitions from static prevention (Market Design) and component-level hardening (Baseline Agent Safety) to active, real-time detection and response. This is useful because preventative measures alone are insufficient to address all potential failure modes, particularly novel emergent behaviours or unanticipated collusion patterns. While the *Market Design* (3.1) section described mechanisms that *embed* monitoring and auditing into the market’s core protocols (e.g., ‘Transparency‘ for logging, ‘Reputation‘ for automated trust assessment, ‘Circuit Breakers‘ for automated halts), this section details the *dedicated* analytical and governance frameworks that sit *above* that infrastructure (busuioc2022ai; holzinger2024human; sharma2024governance). These systems consume the data generated by the market protocols to perform higher-level, active oversight. The following principles (3.3.1–3.3.5) outline this framework, which combines scalable automated monitoring (3.3.1, 3.3.3), proactive adversarial testing (3.3.4), essential post-hoc analytical tooling (3.3.5), and robust, independent human oversight (3.3.2). #### 3.3.1 Real-time Monitoring for Systemic Risks The agentic sandbox should incorporate mechanisms for continuously tracking interactions, transactions, dependencies, and information flows to identify critical nodes, distribution and utilization of computational resources, access to shared digital and/or physical goods, concentration of power, and potential pathways through which malicious collusion may occur. Security protocols ought to be established to define and track key risk indicators for networked agentic systems. Bespoke software and AI solutions are needed to facilitate scalable real-time monitoring of high-frequency interactions between agents. This is especially important given that the frequency of these interactions may greatly exceed the capacity of human overseers. Human oversight may therefore need to be complemented and augmented by appropriate tooling implemented at the market level. It may be possible to utilize weaker AI systems to scalably oversee stronger AI systems, though more work is required to fully understand the safety implications of these approaches as the capability of the stronger systems, or AI agent collectives, approaches AGI (kenton2024scalable; engels2025scaling). Recursive self-critiquing may potentially prove beneficial at higher capability levels (wen2025scalable). Static monitoring targets are susceptible to Goodhart’s Law: once a measure becomes a target, it ceases to be a good measure. Advanced agents will likely optimize their behavior to skirt precisely below static thresholds (e.g., throttling transaction bursts to avoid circuit breakers). Therefore, monitoring regimes must employ *dynamic and hidden metrics*, akin to financial audits, where the specific trigger conditions for intervention are opaque to the market participants. #### 3.3.2 Independent Oversight Human oversight of AI agent interactions should be established, and there should be multiple independent (shneiderman2016dangers; falco2021governing; raji2022outsider) entities with the authority to monitor and investigate muilti-agent markets, identify and investigate incidents, and intervene when necessary. These could be certified professionals with sufficient training, equipped with appropriate tools and with clear pathways for escalation. Certified algorithmic oversight may be able to complement the role of human overseers and help scale the volume of monitored agentic interactions. There is also a need to develop mechanisms to protect human operators from manipulation (carroll2023characterizing; sabour2025human), social engineering, or information overload by AI agent collectives. #### 3.3.3 Detecting Proto-AGI Signatures Specific monitoring components ought to be implemented to check for sudden jumps in problem-solving ability across the collective, especially when coupled with resource accumulation and increased coordination. This should include detailed monitoring of the interaction graph to identify sub-graphs (chen2010dense; yow2023machine) that may be solidifying into an intelligence core, and employing methods that can dynamically detect abrupt behavioral changes (bridgeford2025detectingperspectiveshiftsmultiagent). Existing sub-graph extraction methods may need to be extended from their original utilization in human social networks to properly encapsulate the complexity of agentic interactions in the AI agent sandbox. More emphasis would need to be placed on capability tracking and complementarity, as well as intricacies related to agent identity, as individual agents may get deleted and re-instantiated, copied, or create proxies and sub-agents. #### 3.3.4 Red Teaming In order to safeguard the market against various threats, it is important to engage in repeated extensive red teaming (ganguli2022red; ahmad2025openai). There should be a clear escalation route to ensure that any vulnerabilities that are identified can get patched immediately, and that agent interactions are halted if necessary, to give sufficient time for the updates to be applied, and for recertification of participating agents to be conducted if appropriate. Human red-teaming should be complemented by tiered automated red-teaming via specialized adversarial AI systems, for improved coverage and scalability (mazeika2024harmbench). These systems should be expanded beyond identifying the faults of AI agents in isolation (majumdar2025red), and perform extensive checks of agentic AI collectives instead (he2025red). Red-teaming can also be approached via multi-agent solutions, where a diverse set of AI agents can engage in interactive debate (asad2025reddebate) to more reliably identify harmful actions and behaviors. To implement these approaches reliably in the agentic sandbox, there needs to be an agreed-upon set of red teaming standards, with associated verifiable certificates for participating agents. This should be complemented by continuous red teaming within the market, aimed at rapidly identifying and mitigating emergent risks. #### 3.3.5 Forensic Tooling For human overseers to identify root causes of individual failures or systemic cascades, there is a need to develop reliable tooling (jeong2020artificial; alnaqbi2025forensic) for rapid post-incident analysis. This tooling must be capable of parsing large volumes of interaction data to reconstruct causal chains and turn raw traces into legible attack or failure graphs (Ezell2025IncidentAnalysisAgents; Mukherjee2025ProvSeek). Standardised logs, capturing not just outputs but also prompts, tool calls, and environmental states (Chernyshev2024LLMForensicsLogs), would ensure such tooling is broadly applicable across the entire network, enabling analysts to efficiently zoom in on suspicious activity and generate actionable post mortems. ### 3.4 Regulatory Mechanisms The fourth and final layer of the defence-in-depth model consists of the external regulatory and institutional frameworks that govern the agentic market. While the preceding layers detail the market’s internal technical architecture (3.1), component-level requirements (3.2), and active operational oversight (3.3), this layer provides the essential sociotechnical interface with human legal, economic, and geopolitical structures. These mechanisms are not embedded within the market’s code but rather enclose it, providing an external source of authority, accountability, and systemic risk management. The following principles (3.4.1–3.4.5) outline the necessary components for legal integration, standardisation, financial remediation of harms, and the mitigation of geopolitical risks. #### 3.4.1 Legal Liability and Accountability There should be clear frameworks for assigning liability in case of harm that results from collective actions of agents. In case of distributed and delegated decision-making, no single agent may be fully responsible for the outcome, making auditability, traceability, and explainability a key requirement when permitting consequential actions. Credit assignment (nguyen2023credit) that aims to associate outcomes with all of the preceding relevant actions is a hard problem even in individual agents, and it would likely be highly non-trivial in the multi-agent setting (li2025multi). However, this challenge is not without precedent; legal systems provide a robust model for this in (for example) the form of corporate law, where liability is assigned to the firm - a group agent (list2011group) - as a single legal entity, rather than to its individual employees. This suggests the problem is tractable, requiring the creation of analogous technical and legal structures for agent collectives (list2021group). In case of patchwork AGI, it would be important to be able to reliably identify all of the responsible agents for each set of actions that correspond to a dangerous capability or a harfmul behavior (franklin2022causal). #### 3.4.2 Standards and Compliance There is a pressing requirement for establishing robust standards for agent safety, interoperability, and reporting. These standards must be developed with sufficient foresight to account not only for present-day capabilities but also for rapidly emerging individual agent skills and the potential emergence of collective intelligence (patchwork AGI). Beyond mere technical specifications, standards serve as the foundational infrastructure for market-based AI governance, translating abstract technical risks into legible financial risks that can be priced by insurers, investors, and procurers (Tomei2025AIGovernanceMarkets). To be effective, these standards ought to be underpinned by rigorous disclosure frameworks that reduce information asymmetry between agent developers and market participants. Such disclosures should cover critical areas including incident reporting, capability evaluations, and resource utilisation, while remaining scale-sensitive to avoid disproportionately burdening smaller innovators (Tomei2025AIGovernanceMarkets). Enforcement of these standards can then be achieved not just through centralised government action, but through ’regulatory markets’, where licensed private entities—such as auditors and insurers—compete to provide oversight and certify compliance with public safety goals (HadfieldClark2023RegulatoryMarkets). #### 3.4.3 Insurance Given the difficulties in establishing clear responsibility in collective decision-making scenarios, agentic markets should incorporate insurance mechanisms (lior2021insuring). Beyond merely providing a compensation pool for harms resulting from misaligned collective behaviours, insurance functions as a critical governance instrument. By setting risk-based premiums and strict underwriting criteria, insurers can incentivise safer development practices and hard-gate market entry for agents that lack sufficient controls (Tomei2025AIGovernanceMarkets). Possession of appropriate agentic insurance can include policy requirements and premiums scaling dynamically based on the assessed risk level of the agent’s certified role and intended tasks (Kvist2025UnderwritingSuperintelligence). #### 3.4.4 Anti-Agent-Monopoly Measures A particular risk in the patchwork AGI scenario involves having a group of agents acquire too much power. A patchwork AGI collective could then potentially rapidly outcompete the rest of the market and employ such resources to attempt to resist mitigations in case of harmful and misaligned behavior (tomei2024machina). Ensuring that no single agent, agent orchestrator, agent-holding entity or agent collective amasses an outsized amount of power and compute within the agentic sandbox, mitigates these concerns. Entities that engage in excessive power-seeking behavior (carlsmith2022power; turner2022avoiding) should be flagged for additional checks and inspection. Methods for identifying powerful agents and agent collectives could be built upon prior techniques for identifying power in social networks (fombrun1983attributions). Notions of social power, as influence, would need to be complemented by an understanding of the affordances of each agent, and the resources and tools at their disposal. We recommend investigating how established institutional safeguards against centralization such as leveling mechanisms can be reimagined as protocols to maintain diversity within multi-agent ecosystems. #### 3.4.5 International Coordination Given the global nature of AGI risks, and the potentially far-reaching consequences of both positive and negative outcomes within distributed agentic markets, international agreements (scholefield2025internationalagreementsaisafety) and regulatory harmonization may play an important role in safeguarding against risks (gruetzemacher2023international). This should also ensure that there are no safe heavens for misaligned AI agents or agent collectives, and that all AI agent markets conform to a basic set of safety standards. To ensure compliance with international agreements, verification mechanisms may be required (wasil2024verification). For agentic markets and virtual AI sandboxes specifically, there is also a question of localization – if these virtual entities would span through the international market more freely, or whether they would be contained and regulated within the respective national economies. International coordination around safety may be required in either case, though the details may depend on the exact market model that gets adopted in practice. Thorough harmonization of standards would enable a potentially more open and interoperable agentic net, where openness becomes a feature rather than a vulnerability. #### 3.4.6 Infrastructure Governance and Capture The proposed framework may be envisioned as having a substantial degree of centralized infrastructure or bodies for safety enforcement. Should agentic economies incorporate too much centralization, seen as beneficial for effective governance, this would in turn lead to another critical vulnerability: the risk of capture. The integrity of the agentic market depends on the impartial administration of these core components. If this infrastructure were to be captured, whether by powerful human interests, or by the emergent patchwork AGI itself, this would also compromise the safety and governance mechanisms, as they may potentially be disabled, bypassed, or in the worst case scenario, weaponized. This highlights a fundamental point of tension between a decentralized vision of the market and the existence of some centralized oversight nodes. Addressing this requires robust socio-technical solutions to ensure that the governors remain accountable and incorruptible. ## 4 Conclusion The eventual hypothetical development of AGI (or ASI) may not follow the linear and more predictable path of intentionally creating a single, general-purpose entity. AGI, and subsequently ASI, may first emerge as an aggregate property of a more distributed network of diverse and specialized AI agents with access to tools and external models. The AI safety and alignment research needs to reflect this possibility, by broadening its scope to increase preparedness for hypothetical multi-agent AGI futures. Deepening our understanding of multi-agent alignment mechanisms is crucial irrespective of whether AGI first emerges as a patchwork, or as a single entity. The framework introduced in the paper is relevant not only for the emergence of AGI, but also for managing interactions in multi-AGI scenarios (whether interactions are direct or through a proxy web environment and via human users) and, critically, for mitigating the risks of a rapid, distributed transition to Artificial Superintelligence (ASI) via recursive optimization of the network’s components and structure. More specifically, we believe that well-designed and carefully safeguarded market mechanisms offer a promising path forward, and that more AI alignment research should be centered on agent market design, and safe protocols for agent interaction. While doubtlessly challenging, this approach offers a potentially scalable path forward. Methodological work on safe market design ought to be complemented by the rapid development of benchmarks, test environments, oversight mechanisms, and regulatory principles that would make these approaches feasible in the future. Many of the measures that we bring up are yet to be fully developed in practice, representing an open research challenge. We would like for this paper to act as a call to action, and help direct the attention of safety researchers towards addressing these challenges and helping design a safe and robust agentic web.