FT: New benchmarks for Gen AI models; Neocloud groups leverage Nvidia chips to borrow >$11B

The Financial Times reports that technology companies are rush­ing to redesign how they test and eval­u­ate their Gen AI mod­els, as cur­rent AI bench­marks appear to be inadequate.  AI benchmarks are used to assess how well an AI model can generate content that is coherent, relevant, and creative. This can include generating text, images, music, or any other form of content.

OpenAI, Microsoft, Meta and Anthropic have all recently announced plans to build AI agents that can execute tasks for humans autonom­ously on their behalf. To do this effect­ively, the AI sys­tems must be able to per­form increas­ingly com­plex actions, using reas­on­ing and plan­ning.

Cur­rent pub­lic AI bench­marks — Hel­laswag and MMLU — use mul­tiple-choice ques­tions to assess com­mon sense and know­ledge across vari­ous top­ics. However, research­ers argue this method is now becom­ing redund­ant and mod­els need more com­plex prob­lems.

“We are get­ting to the era where a lot of the human-writ­ten tests are no longer suf­fi­cient as a good baro­meter for how cap­able the mod­els are,” said Mark Chen, senior vice-pres­id­ent of research at OpenAI. “That cre­ates a new chal­lenge for us as a research world.”

The SWE Veri­fied benchmark was updated in August to bet­ter eval­u­ate autonom­ous sys­tems based on feed­back from com­pan­ies, includ­ing OpenAI. It uses real-world soft­ware prob­lems sourced from the developer plat­form Git­Hub and involves sup­ply­ing the AI agent with a code repos­it­ory and an engin­eer­ing issue, ask­ing them to fix it. The tasks require reas­on­ing to com­plete.

“It is a lot more chal­len­ging [with agen­tic sys­tems] because you need to con­nect those sys­tems to lots of extra tools,” said Jared Kaplan, chief sci­ence officer at Anthropic.

“You have to basic­ally cre­ate a whole sand­box envir­on­ment for them to play in. It is not as simple as just provid­ing a prompt, see­ing what the com­ple­tion is and then eval­u­at­ing that.”

Another import­ant factor when con­duct­ing more advanced tests is to make sure the bench­mark ques­tions are kept out of the pub­lic domain, in order to ensure the mod­els do not effect­ively “cheat” by gen­er­at­ing the answers from train­ing data, rather than solv­ing the prob­lem.

The need for new bench­marks has also led to efforts by external organ­iza­tions. In Septem­ber, the start-up Scale AI announced a project called “Human­ity’s Last Exam”, which crowd­sourced com­plex ques­tions from experts across dif­fer­ent dis­cip­lines that required abstract reas­on­ing to com­plete.

Meanwhile, the Financial Times recently reported that Wall Street’s largest financial institutions had loaned more than $11bn to “neocloud” groups, backed by their possession of Nvidia’s AI GPU chips. These companies include names such as CoreWeave, Crusoe and Lambda, and provide cloud computing services to tech businesses building AI products. They have acquired tens of thousands of Nvidia’s graphics processing units (GPUs) through partnerships with the chipmaker. With capital expenditure on data centres surging, in the rush to develop AI models, the Nvidia’s AI GPU chips have become a precious commodity.

Nvidia’s chips have become a precious commodity in the ongoing race to develop AI models © Marlena Sloss/Bloomberg

…………………………………………………………………………………………………………………………………

The $3tn tech group’s allocation of chips to neocloud groups has given confidence to Wall Street lenders to lend billions of dollars to the companies that are then used to buy more Nvidia chips. Nvidia is itself an investor in neocloud companies that in turn are among its largest customers. Critics have questioned the ongoing value of the collateralised chips as new advanced versions come to market — or if the current high spending on AI begins to retract. “The lenders all coming in push the story that you can borrow against these chips and add to the frenzy that you need to get in now,” said Nate Koppikar, a short seller at hedge fund Orso Partners. “But chips are a depreciating, not appreciating, asset.”

References:

https://www.ft.com/content/866ad6e9-f8fe-451f-9b00-cb9f638c7c59

https://www.ft.com/content/fb996508-c4df-4fc8-b3c0-2a638bb96c19

https://www.ft.com/content/41bfacb8-4d1e-4f25-bc60-75bf557f1f21

Tata Consultancy Services: Critical role of Gen AI in 5G; 5G private networks and enterprise use cases

Reuters & Bloomberg: OpenAI to design “inference AI” chip with Broadcom and TSMC

AI adoption to accelerate growth in the $215 billion Data Center market

AI Echo Chamber: “Upstream AI” companies huge spending fuels profit growth for “Downstream AI” firms

AI winner Nvidia faces competition with new super chip delayed