中文
English

ChatGPT Agent: Not surpassing Manus' capabilities, but seeing the dawn of end-to-end

2025-07-21

ChatGPT Agent: Not surpassing Manus' capabilities, but seeing the dawn of end-to-end

36 duty assistant

·July 19, 2025 09:34

Operator and Deep Research merge

Agent is the biggest consensus in the AI community this year, and OpenAI naturally cannot fall behind.


At 1am on July 18, 2025 Beijing time, Sam Altman and four OpenAI researchers officially released ChatGPT Agent - a general-purpose AIAgent - during a live broadcast.




Previously, there were Manus, Lovart, and Flowith. The functional scenarios presented by ChatGPT Agent were not particularly stunning, but the significance of its release goes beyond its functionality itself.


The revolution of ChatGPT Agent lies in its unique technological path: it can actively select agent skills from the toolbox, complete tasks using its own computer, and users can observe the AI's working process in a virtual environment in real time.


Although this interactive interface is similar to products such as Manus, the underlying principles have fundamental differences. Manus calls multiple underlying models, similar to "external stitching", while ChatGPT Agent internalizes agent capabilities into the model. We have seen the prototype of an end-to-end universal agent.




Manus' design essentially achieves' external stitching 'by calling multiple underlying models. In contrast, ChatGPT Agent internalizes its agent capabilities within the model itself.


According to OpenAI, in order to develop the ChatGPT Agent, they merged the Operator and Deep Research teams into a unified team consisting of 20 to 35 people.


According to the system card of ChatGPT Agent, it is a new agent model belonging to the same series as OpenAI o3, and adopts an end-to-end training method. It is a unified model developed for agent tasks, rather than an engineering combination of multiple models.




According to the comparison PPT released by OpenAI, we can see that this training is basically completed through the reinforcement learning process. The path should be similar to Grok4withtool.




After retraining, the Agent combines the multi-step research and high-quality report generation capabilities of Deep Research, the ability of Operators to execute tasks through remote visual browser environments, terminal tools with limited network access, and the ability to access external data sources and applications through connectors.


After completing complex tasks, a downloadable PPT or document can also be delivered to the user.


For Manus, OpenAI's new initiative is undoubtedly a huge blow, and even in terms of pricing, the difference between the two is not significant: the GPT Plus package can use ChatGPT Agent for $20 per month, while Manus' basic plan is $19 per month.


Key points:

ChatGPT Agent: A unified AI agent capable of executing complex, multi tool tasks.


It integrates access to text browsers, GUI browsers, terminals, and image generation tools.


Support interactive and multi round conversations with users, allowing interruptions and clarifications.


Security Protection Upgrade: Strengthen the defense against "malicious prompts" attacks on web pages; Set up automatic rejection of high-risk tasks; Biological/chemical risks are also handled according to the highest level of safety stack.


It has achieved state-of-the-art results in multiple real-world and benchmark tasks.


ChatGPT Agent Overview: Functionality Similar to Manus

The core of ChatGPT Agent is a unified agent system that integrates and extends the capabilities of OpenAI's early research projects "Operator" (focusing on website interaction) and "Deep Research" (focusing on information synthesis).


This enables ChatGPT Agent to seamlessly switch from reasoning and thinking to executing specific actions within a single conversation flow.


Virtual computer environment: ChatGPT Agent executes all tasks on a specially designed virtual computer. This environment is sandboxed to ensure the security of operations. It can save the context of tasks in the environment, and even if the user interrupts or changes instructions midway, it can continue from the breakpoint without losing progress.


Intelligent Toolbox: In order to complete complex workflows, the Agent is equipped with four tools and can automatically select the most suitable tool based on task requirements:


Visual Browser: Used for interacting with graphical user interfaces, such as clicking buttons, filling out forms, and browsing websites designed for humans.


Text based Browser: Used for network queries that require efficient inference and processing of large amounts of text.


Terminal: Allow agents to run code, download and process files.


API access: You can directly call APIs to obtain information, such as accessing data from applications like Google Drive, Gmail, and GitHub through connectors.




New Model Driven: ChatGPT Agent is driven by a new model specifically developed for it. This model has been specifically trained on complex tasks that require the use of multiple tools through reinforcement learning methods, thus learning how to smoothly switch and collaborate between different tools.




It has the following characteristics:


Autonomous task execution: Users can issue instructions in natural language, such as "analyze my calendar and brief me on upcoming client meetings based on recent news." Agents can autonomously plan and execute a series of operations, such as browsing websites, filtering information, running code analysis, and ultimately generating editable slides or spreadsheets.




Collaboration and Interactivity: It will proactively ask for more details when needed to achieve goals. Users can interrupt, redirect tasks, or take full control of the browser at any time.


Security and permission control: Security is the core part of its design. Before performing critical operations that have practical impact, such as purchasing, submitting forms, sending emails, or processing personal information, the agent will explicitly request user permission. Meanwhile, it is prohibited from carrying out high-risk tasks such as financial transfers or providing legal advice. OpenAI also has built-in protection measures against malicious attacks such as "prompt injection".


Multiple benchmark test scores break records

The most difficult HLE reached 41.6% (with tool), higher than the just released Grok4 (with tool) at 41.0%.


On Humanity's Last Exam, which measures wide area knowledge and expert level questioning, the accuracy rate of a single answer reached 41.6%; By using parallel eight way reasoning and selecting the answer with the highest confidence level, it can be improved to 44.4%.




On the extremely difficult FrontierPath mathematical benchmark, the accuracy was improved to 27.4% after running the code with the terminal.




In the internal evaluation of real knowledge work tasks, ChatGPT agents have been on par or better than humans in about half of the cases;




On the real-world data science task DSBench, its analysis and modeling accuracy reached 89.9% and 85.5%, respectively, far exceeding the human average level.




It also leads in direct editing ability for spreadsheets: achieving 45.5% in Spreadsheet Bench, surpassing Copilot in Excel's 20%. In addition, it has refreshed SOTA in browsing reviews such as BrowseComp and WebArena.




(Image: Evaluation method: The author of Spreadsheet Bench evaluated the spreadsheet using Microsoft Excel in a Windows environment.). We use LibreOffice in the OSX environment, which may result in slight differences in ratings. For example, the author reported a result of 15.02% for GPT-4o on the overall Hard limit, while we obtained 13.38%. We used a complete benchmark test of 912 questions. )


According to the PPT created by ChatGPT Agent, the agent's ability to create PPT and surf the internet has significantly improved compared to the pure basic model. But there is still a long way to go from humans.






Not futures, available today

Starting today, Pro users can use it immediately, while Plus and Team users will gradually activate it within a few days; The Enterprise and Education versions will be integrated in a few weeks.


Pro can receive 400 messages per month, while other paying users have a monthly limit of 40 messages, which can be added through flexible pay as you go billing.


The actual use is very simple: switch to "proxy mode" in any conversation, describe the goal, such as conducting in-depth research, making presentations, or reimbursement. Real time display of its operation process on the left side of the screen; If login is required, the system will switch to "takeover mode" for secure input of credentials.


Users can also set the completed tasks to be executed periodically, such as automatically generating indicator reports every Monday.


Ultraman personally warns of risks: Agents are powerful and dangerous

It is worth noting that Ultraman immediately posted a long post after the press conference, warning about the risks of using ChatGPT Agent.


After emphasizing the powerful ability of ChatGPT Agent to handle complex tasks, we solemnly pointed out the risks of the product and emphasized that we are not yet clear about the specific impact, but criminals may attempt to "trick" users' AI agents into providing private information that should not be provided and taking actions that should not be taken, and we cannot predict the ways in which this will happen.


The model may come into contact with sensitive user data or encounter malicious' prompt injection 'attacks in web pages. For this reason, they continue to use the strict control during the Operator period and have added multiple protections:


Clear user authorization must be obtained before key actions are taken;


Enabling 'supervision mode' for some high-risk tasks (such as sending emails) requires users to monitor the entire process;


Will actively refuse high-risk instructions such as bank transfers;


Users can easily clear browsing data and log out of all sessions with just one click, or disable the connector when not connected to the internet.


In terms of biological and chemical security, OpenAI has classified the model as high-risk based on the Preparedness Framework, implemented the most comprehensive security measures, and collaborated with government, academia, and security agencies to conduct red team testing and threat modeling. At the same time, OpenAI has launched a vulnerability bounty program to detect and patch potential issues as early as possible.




Is ChatGPT Agent far ahead enough?

The biggest innovation of ChatGPT Agent is the direct integration of a complete virtual machine environment into the model for the first time, allowing users to observe the AI's operation process in real time, which is not available in other model products.


However, mainstream model companies are moving further and further down the path of 'Agent as Model, Model as Agent'. For example, Claude, who is almost a god in coding agent abilities.


Many Agent products that require borrowing underlying models to build, even without Claude, are nothing.


The newly launched Kimi K2 adopts an open-source hybrid expert model architecture, positioned as Agentic Intelligence, and priced at only about 1/6 of Claude 4. After going online, the adoption ranking of tokens continued to soar.


But in terms of the path of 'model as agent', OpenAI cannot be considered far ahead, it can only be said to have taken a small step forward.


OpenAI also humbly stated in its official documentation that:


It should be noted that the functionality is still in its early stages: for example, the slide generation function is currently in beta, and the format and aesthetics still need to be improved. At this stage, the main focus is on optimizing the information structure and element editability; In the future, we will continue to train new versions to generate more refined files. Overall, with continuous iteration, the efficiency, depth, and diversity of ChatGPT agents will continue to improve, and we will gradually optimize the strength of user supervision to achieve a better balance between usability and security.




Watching the demonstration of his own product, Sam Altman couldn't help but exclaim, 'I feel AGI.'.


However, there were still user comments after the post asking, 'What about the GPT-5 that was agreed upon?'?






This article is from the WeChat official account "Tencent Technology", written by Xiao Jing and Bo Yang, 36 Krypton has been authorized to release.


The viewpoint of this article only represents the author himself, and the 36Kr platform only provides information storage space services.


The images in this article are taken by the author, provided through interviews, and authorized by the company




ChatGPT Agent:没超越Manus的能力范畴,但看到了端到端的曙光

36值班小助手·2025年07月19日 09:34
Operator和Deep Research合体

Agent是今年AI圈最大的共识,OpenAI自然也不能掉队。

北京时间2025年7月18日凌晨1点,Sam Altman和四位OpenAI 的研究员在直播中正式发布了ChatGPT Agent——一款通用型AIAgent。

前有Manus、Lovart和Flowith,ChatGPT Agent所呈现的功能场景并不算特别惊艳,但它发布的意义,要超越其功能本身。

ChatGPT Agent的革命性在于其独特的技术路径:它可以主动从工具箱中选择代理技能,使用自己的计算机完成任务,用户可以实时观察AI在虚拟环境中的工作过程。

这种交互界面虽与Manus等产品相似,但底层原理却有着本质差异。Manus调用多个底层模型,类似于“外部缝合”,而ChatGPT Agent,是将Agent能力内化于模型,我们已经看到了端到端通用Agent的雏形。

Manus的设计实质上是通过调用多个底层模型实现"外部缝合"。相比之下,ChatGPT Agent是将Agent能力内化于模型本身。

根据OpenAI介绍,为了开发ChatGPT Agent,他们将Operator和Deep Research团队合并为一个统一的团队,这个新团队由20至35人组成。

根据ChatGPT Agent的系统卡片显示,它是一个新的代理模型,与OpenAI o3同属一个系列,采用了端到端的训练方法。它是为代理任务开发的统一模型,而不是多个模型的工程化组合。

根据OpenAI放出的对比PPT,我们可以看到,这一训练基本上是通过强化学习过程完成的。和Grok4withtool的路径应该差不多。

经过再训练,Agent结合了Deep research的多步研究和高质量报告生成能力、Operator通过远程可视化浏览器环境执行任务的能力、具有有限网络访问权限的终端工具,以及通过连接器访问外部数据源和应用程序的能力。

在执行完复杂任务之后,也可以交付给用户一个可下载的一个PPT或一份文档。

对Manus而言,OpenAI的这一新举措无疑是巨大的打击,甚至从定价上,两者也差距不大:GPT的Plus套餐每月20美金即可使用ChatGPT Agent,而Manus的基础计划是每月19美金。

划重点:

ChatGPT Agent:是能够执行复杂、多工具任务的统一AI Agent。

它集成了对文本浏览器、GUI 浏览器、终端和图像生成工具的访问。

支持与用户进行交互式、多轮对话,允许打断和澄清。

安全防护升级:加强对网页“恶意提示”攻击的防御;设置高风险任务自动拒绝;生物/化学风险也按最高级别安全堆栈处理。

它在多个现实世界和基准任务中取得了最先进的结果。

ChatGPT Agent概览:功能很像Manus

ChatGPT Agent的核心是一个统一的代理系统 (unified agentic system),整合并扩展了 OpenAI 早期研究项目 "Operator"(侧重于网站交互)和 "Deep Research"(侧重于信息综合)的能力。

这使得 ChatGPT Agent 能够在一个单一的对话流中,无缝地从推理思考切换到执行具体动作。

虚拟计算机环境:ChatGPT Agent在一个为其特设的虚拟计算机上执行所有任务。这个环境是沙盒化的,确保了操作的安全性。它能够在该环境中保存任务的上下文,即使用户中途打断或改变指令,也能从断点继续,而不会丢失进度。

智能工具箱:为了完成复杂工作流,Agent 配备了四种工具,并能根据任务需求自动选择最合适的工具:

可视化浏览器 (Visual Browser): 用于与图形用户界面进行交互,例如点击按钮、填写表单和浏览为人类设计的网站。

文本浏览器 (Text-based Browser): 用于需要高效推理和处理大量文本的网络查询。

终端 (Terminal): 允许 Agent 运行代码、下载和处理文件。

API 访问: 可以直接调用 API 来获取信息,例如通过连接器访问 Google Drive、Gmail 和 GitHub 等应用的数据。

新模型驱动:ChatGPTAgent由一个专门为其开发的新模型驱动。这个模型通过强化学习 (reinforcement learning) 的方法,在需要使用多种工具的复杂任务上进行了专门训练,从而学会了如何在不同工具之间流畅切换并协同工作。

它有以下特性:

自主任务执行: 用户可以用自然语言下达指令,例如“分析我的日历,并根据最近的新闻为我简报即将到来的客户会议”,Agent 能够自主规划并执行系列操作,如浏览网站、筛选信息、运行代码分析,并最终生成可编辑的幻灯片或电子表格等成果。

协作与交互性:它会在需要时主动询问更多细节以完成目标。用户可以随时中断、重定向任务或完全接管浏览器的控制权。

安全与权限控制: 安全性是其设计的核心部分。在执行购买、提交表单、发送邮件或处理个人信息等具有实际影响的关键操作前,Agent 会明确请求用户许可。同时,它被禁止执行如金融转账或提供法律建议等高风险任务。OpenAI 还内置了针对“提示注入”等恶意攻击的防护措施。

多项基准测试跑分“破纪录”

最难的 HLE 达到 41.6%(with tool), 高于刚刚发布的Grok4(with tool)41.0%。

在测量广域知识与专家级提问的 Humanity’s Last Exam 上,单次作答准确率达 41.6%;采用并行八路推理并选取置信度最高答案后可提升到 44.4%。

在极难的 FrontierMath 数学基准上,借助终端运行代码后准确率提升至 27.4%。

在针对真实知识工作任务的内部评测中,ChatGPT 代理在约半数案例里已与人类持平或更佳;

在现实数据科学任务 DSBench 上,其分析与建模准确率分别达到 89.9% 与 85.5%,远超人类平均水平。

它对电子表格的直接编辑能力也领先:在 SpreadsheetBench 中拿到 45.5%,超过 Copilot in Excel 的 20%。此外,它在 BrowseComp、WebArena 等浏览评测里均刷新了SOTA。

(图:评测方法:SpreadsheetBench的作者在Windows 环境下使用 Microsoft Excel 对电子表格进行评估。我们则在 OSX 环境中使用 LibreOffice,这可能导致评分出现轻微差异。例如,作者报告 GPT‑4o 在整体 Hard 限制上的结果为 15.02%,而我们得到 13.38%。我们使用了完整的 912 道题目基准测试。)

根据ChatGPT Agent自己做的PPT,在做PPT的能力上和上网冲浪能力上,Agent的能力都相比纯粹的基础模型有较明显的提升。但离人类还颇有距离。

不是期货,今日可用

自今日起,Pro 用户可以马上使用,Plus 与 Team 用户将在数日内陆续开通;Enterprise 与 Education 版本将于数周后接入。

Pro 每月可用 400 条消息,其他付费用户每月额度为 40 条,可通过灵活的按量计费追加。

实际使用非常简单:在任何对话中切到「代理模式」,描述目标,例如深度调研、制作演示或报销。屏幕左侧实时显示它的操作流程;若需要登录,系统会切换到「接管模式」安全输入凭证。

用户还可以把完成的任务设为周期性执行,例如每周一自动生成指标报告。

奥特曼亲自提示风险:Agent很强大,也很危险

值得注意的是,奥特曼在发布会之后,立刻发了一条长贴,提示使用ChatGPT Agent的风险。

在“强调”过ChatGPT Agent处理复杂任务的强大能力后,特别郑重地提示了产品的风险,并强调:我们尚不清楚具体会造成什么影响,但不法分子可能会试图“诱骗”用户的 AI 代理提供不该提供的私人信息并采取不该采取的行动,而这其中的方式我们无法预测。

模型可能会接触用户的敏感数据,或遭遇网页中的恶意「提示注入」攻击。为此,他们沿用 Operator 期间的严格控制,并新增多项防护:

关键动作前必须得到用户明确授权;

部分高风险任务(如发送邮件)启用「监督模式」要求用户全程监控;

碰到银行转账等高风险指令会主动拒绝;

用户可以一键清除浏览数据并注销全部会话,或在不需联网时禁用连接器。

在生物与化学安全方面,OpenAI根据 Preparedness Framework 将该模型按高风险级别处理,上线了最全面的安全措施,并与政府、学界及安全机构合作开展红队测试与威胁建模,同时启动漏洞赏金计划,以便尽早发现并修补潜在问题。

ChatGPT Agent够遥遥领先吗?

ChatGPT Agent最大的创新在于首次在模型中直接集成了完整的虚拟机环境,用户可以实时观察AI的操作过程,这是其它模型产品不具备的。

但是,各主流模型公司都在“Agent即模型,模型即Agent”的路上越走越远。比如,在coding agent能力上几乎封神的Claude。

众多需要借用底层模型搭建的Agent产品,甚至离开了Claude,就什么也不是。

刚刚上线的Kimi K2采用开源的混合专家模型架构,定位就为Agentic Intelligence,且价格仅有Claude 4的1/6左右。上线之后,token的采用量排名持续飙升。

但从“模型即Agent”这条路来说,OpenAI并不能算是遥遥领先,仅仅能说迈出了一小步。

OpenAI在官方文档中也特别谦虚地表示:

需要注意的是,功能仍处早期:例如幻灯片生成功能现为 beta,格式与美观度仍待提升,现阶段主要优化信息结构与元素可编辑性;未来我们将继续训练新版本,以生成更精致的文件。总的来说,随着持续迭代,ChatGPT 代理的效率、深度和多样性都会不断提升,我们也会逐步调优用户监督的力度,在易用与安全之间取得更好平衡。

看着自家产品的演示,Sam Altman不禁又开始感叹,“我感受到了AGI”。

然而,在帖子后面还是有用户留言问,说好的GPT-5呢?

本文来自微信公众号“腾讯科技”,作者:晓静、博阳,36氪经授权发布。


Read0
share
Write a Review...
推荐文章
1  /  194