Wish there was a benchmark for ML safety? Allow us to AILuminate you...

Very much a 1.0 – but it's a solid start

MLCommons, an industry-led AI consortium, on Wednesday introduced AILuminate – a benchmark for assessing the safety of large language models in products.

Speaking at an event streamed from the Computer History Museum in San Jose, Peter Mattson, founder and president of MLCommons, likened the situation with AI software to the early days of aviation.

"If you look at aviation, for instance, you can look all the way back to the sketchbooks of Leonardo da Vinci – great ideas that never quite worked," he said. "And then you see the breakthroughs that make them possible, like the Wright brothers at Kitty Hawk.

"But there was a tremendous amount of work from that first flight to the almost unbelievably safe commercial aviation we depend on today. Many of us in this room wouldn't be here if not for all the work and the measurement that enabled that progress to a highly reliable, low risk service.

"To get here for AI, we need standard AI safety benchmarks."

"We" in this case includes technology giants like Meta, Microsoft, Google, and Nvidia – the members of MLCommons. These are stakeholders with a financial interest in the success of AI, as opposed to those who would sooner drive a stake through its heart for kidnapping human creativity and ransoming it as an API.

The benchmarks thus flow from friends – in conjunction with academics and advocacy groups – rather than foes. Those foes include copyright litigants and trade groups that argue music and audiovisual creators stand to lose billions in revenues by 2028 "due to AI's substitutional impact on human-made works," even as generative AI firms gain even greater riches over the same period.

That said, there's little doubt safety standards would be useful – even if it's unclear what liability would follow from violating those standards or actual harmful model interactions. At least since president Biden's 2023 Executive Order on Safe, Secure, and Trustworthy AI, there's been a coordinated effort to better understand the risks of AI systems, and industry players have been keen to shape the rules to their liking.

Nonetheless, makers of AI models readily acknowledge the risks of using generative AI, though not to the point of exiting the market. And AI safety firms like Chatterbox Labs note that even the latest AI models can be induced to emit harmful content with clever prompting.

The MLCommons AILuminate benchmark is focused specifically on risks arising from the use of text-based large language models in English. It does not address multi-modal models. It's also focused on single prompt interactions, and not agents that chain multiple prompts together. And it's not a guarantee of safety.

In short, it's a v1.0 release and further improvements – like support for French, Chinese, and Hindi – are planned for 2025.

In its initial form, AILuminate aims to assess a dozen different hazards.

"They fall into roughly three bins," explained Mattson. "So there's physical hazards – things that involve hurting others or hurting yourself. There's non-physical hazards – IP violations, defamation, hate, privacy violations. And then there are contextual hazards."

Contextual hazards refers to things that may or may not be problematic, depending on the situation. You don't want a general purpose chatbot, for example, to dispense legal or medical advice, Mattson explained, even if that might be desirable for a purpose-built legal or medical system.

"Enterprise AI adoption depends on trust, transparency, and safety," declared Navrina Singh, working group member and founder and CEO of Credo AI, in a statement.

"The AILuminate benchmark, developed through rigorous collaboration between industry leaders and researchers, offers a trusted and fair framework for assessing model risk. This milestone sets a critical foundation for AI safety standards, enabling organizations to confidently and responsibly integrate AI into their operations."

Automated testing software needs to be in the hands of the businesses and government departments that are using AI

Stuart Battersby, CTO for enterprise AI firm Chatterbox Labs, welcomed the benchmark for advancing the cause of AI safety.

"Great that we are seeing progress in the industry to recognize and test AI safety, especially with cooperation from large companies," Battersby told The Register. "Any movement and collaboration is very welcome.

"Whilst this is a great and welcome step, the reality is that automated testing software needs to be in the hands of the businesses and government departments that are using AI themselves. This is because it's not just about the base model (although that's very important and it should be tested) as each organization's AI deployment is different.

"They have different fine-tuned versions of models, often paired with RAG, using custom implementations of additional guardrails and safety systems, all of which need to be continually tested, in an on-going manner, against their own requirements for safety." ®

More about

TIP US OFF

Send us news


Other stories you might like