OpenAI recently announced ChatGPT Agent, which combines the browsing skills of Operator with the summarisation muscle of Deep Research into a single “ChatGPT agent.” Instead of pasting chunks of code or formulas into ChatGPT and then copying answers into productivity apps, developers can now prompt one tool to gather data, reason about it and return an editable spreadsheet or deck in one shot.
The agent writes valid .xlsx and .pptx files by emitting Python code behind the scenes, so the output opens in Excel, LibreOffice, PowerPoint, Keynote, or any library that understands the open formats. Entrepreneur’s early hands-on notes that even simple one-line prompts generate coherent decks. Behind the chat window, the agent chooses among a GUI browser, a text browser, a POSIX-like terminal, and direct API calls. It can log in to SaaS tools through “connectors,” scrape a table with the text browser, run code in the terminal to shape the data, and drop the result into LibreOffice before handing you the download link.
On SpreadsheetBench ChatGPT agent reaches 45.5 % accuracy versus Copilot-in-Excel’s 20%. OpenAI also claims new state-of-the-art results on DSBench and BrowseComp, plus a 41.6 % pass-at-1 on Humanity’s Last Exam, although these benchmarks assume the agent is allowed to run code and browse.
I would explain this to my own family as .. not something I’d yet use for high-stakes uses or with a lot of personal information until we have a chance to study and improve it in the wild. – Sam Altman
From a developer’s perspective the agent is just another ChatGPT tool choice, so anything you build on the Assistants API inherits it automatically. Connectors let you point the agent at private GitHub repos or Grafana dashboards, while GitHub projects such as Generative-Excel-Data-Assistant and Azure’s “assistant-agent” notebook show how to embed the workflow inside internal apps. Community lists like awesome-ai-agents catalogue dozens of similar open-source projects you can fork today.
The feature arrives amid a “bumpy summer” for OpenAI but still lands as a headline-grabbing win. TechRadar’s live test planned a Tokyo itinerary and produced a formatted table. On the other hand, benchmarks have shown varying outcomes. Early ZDNet benchmarks found only one in eight multi-step jobs completed without hallucination, while The Information quoted a tester who waited 30 minutes for a task humans finish in 15. OpenAI itself flags higher risk and slower runtimes when the agent juggles multiple tools. The company recently shared IMO Gold with Gemini and other deep-reasoning models, and Alexander Wei also commented the company is “releasing GPT5 soon”.
“The reason why you probably won’t see a complete compression of software where it’s just a an agent and a database is because there’s a lot of logic in the workflow and in the the company’s specific sort of business process that needs to be built into that and around that database… The agent will make a mistake 1% of the time and it will share the wrong thing with somebody or open up an access privilege to the wrong person.” – Box CEO Aaron Levie
High-quality labelled data is still the oxygen every agentic workflow relies upon. That imperative explains why Meta just wrote a $14 billion check for nearly half of Scale AI for human-curated images, code traces and RLHF examples for future Llama releases. Crowdsourcing mainstays like Amazon Mechanical Turk still fill the long-tail of edge-case prompts, while other startups like Turing now tap more than 4 million expert labelers and just tripled revenue to $300 million, pitching itself as a neutral alternative for labs wary of Meta’s new window into Scale’s pipelines.
Developers looking to implement should treat outputs as draft material, enforce sandboxed credentials, and capture logs—stepping-stone practices as the ecosystem matures. If you are looking to learn more, you can refer to the system card.