Latest from MIT Tech Review – Can we fix AI’s evaluation crisis?

As a tech reporter I often get asked questions like “Is DeepSeek actually better than ChatGPT?” or “Is the Anthropic model any good?” If I don’t feel like turning it into an hour-long seminar, I’ll usually give the diplomatic answer: “They’re both solid in different ways.” Most people asking aren’t defining “good” in any precise…

Latest from MIT Tech Review – A Chinese firm has just launched a constantly changing set of AI benchmarks

When testing an AI model, it’s hard to tell if it is reasoning or just regurgitating answers from its training data. Xbench, a new benchmark developed by the Chinese venture capital firm HSG, or HongShan Capital Group, might help to sidestep that issue. That’s thanks to the way it evaluates models not only on the…

Latest from MIT : LLMs factor in unrelated information when recommending medical treatments

A large language model (LLM) deployed to make treatment recommendations can be tripped up by nonclinical information in patient messages, like typos, extra white space, missing gender markers, or the use of uncertain, dramatic, and informal language, according to a study by MIT researchers. They found that making stylistic or grammatical changes to messages increases…

Latest from MIT : Researchers present bold ideas for AI at MIT Generative AI Impact Consortium kickoff event

Launched in February of this year, the MIT Generative AI Impact Consortium (MGAIC), a presidential initiative led by MIT’s Office of Innovation and Strategy and administered by the MIT Stephen A. Schwarzman College of Computing, issued a call for proposals, inviting researchers from across MIT to submit ideas for innovative projects studying high-impact uses of…

O’Reilly Media – Designing Collaborative Multi-Agent Systems with the A2A Protocol

It feels like every other AI announcement lately mentions “agents.” And already, the AI community has 2025 pegged as “the year of AI agents,” sometimes without much more detail than “They’ll be amazing!” Often forgotten in this hype are the fundamentals. Everybody is dreaming of armies of agents, booking hotels and flights, researching complex topics,…

Latest from MIT Tech Review – It’s pretty easy to get DeepSeek to talk dirty

AI companions like Replika are designed to engage in intimate exchanges, but people use general-purpose chatbots for sex talk too, despite their stricter content moderation policies. Now new research shows that not all chatbots are equally willing to talk dirty: DeepSeek is the easiest to convince. But other AI chatbots can be enticed too, if…

Latest from MIT Tech Review – OpenAI can rehabilitate AI models that develop a “bad boy persona”

A new paper from OpenAI released today has shown why a little bit of bad training can make AI models go rogue but also demonstrates that this problem is generally pretty easy to fix.  Back in February, a group of researchers discovered that fine-tuning an AI model (in their case, OpenAI’s GPT-4o) by training it…