
Why we're measuring AI success all wrong—and what leaders should do about it
The way we approach assessment matters. As AI models begin to play a part in everything from hiring decisions to medical diagnoses, our narrow focus on benchmarks and accuracy rates is creating blind spots that could undermine the very outcomes we're trying to achieve. In the long term, it is effectiveness, not efficiency, that matters.
Think about it: When you hire someone for your team, do you only look at their test scores and the speed they work at? Of course not. You consider how they collaborate, whether they share your values, whether they can admit when they don't know something, and how they'll impact your organization's culture—all the things that are critical to strategic success. Yet when it comes to the technology that is increasingly making decisions alongside us, we're still stuck on the digital equivalent of standardized test scores.
The Benchmark Trap
Walk into any tech company today, and you'll hear executives boasting about their latest performance metrics: 'Our model achieved 94.7% accuracy!' or 'We reduced token usage by 20%!' These numbers sound impressive, but they tell us almost nothing about whether these systems will actually serve human needs effectively.
Despite significant tech advances, evaluation frameworks remain stubbornly focused on performance metrics while largely ignoring ethical, social, and human-centric factors. It's like judging a restaurant solely on how fast it serves food while ignoring whether the meals are nutritious, safe, or actually taste good.
This measurement myopia is leading us astray. Many recent studies have found high levels of bias toward specific demographic groups when AI models are asked to make decisions about individuals in relation to tasks such as hiring, salary recommendations, loan approvals, and sentencing. These outcomes are not just theoretical. For instance, facial recognition systems deployed in law enforcement contexts continue to show higher error rates when identifying people of color. Yet these systems often pass traditional performance tests with flying colors.
The disconnect is stark: We're celebrating technical achievements while people's lives are being negatively impacted by our measurement blind spots.
Real-World Lessons
IBM's Watson for Oncology was once pitched as a revolutionary breakthrough that would transform cancer care. When measured using traditional metrics, the AI model appeared to be highly impressive, processing vast amounts of medical data rapidly and generating treatment recommendations with clinical sophistication.
However, as Scientific American reported, reality fell far short of this promise. When major cancer centers implemented Watson, significant problems emerged. The system's recommendations often didn't align with best practices, in part because Watson was trained primarily on a limited number of cases from a single institution rather than a comprehensive database of real-world patient outcomes.
The disconnect wasn't in Watson's computational capabilities—according to traditional performance metrics, it functioned as designed. The gap was in its human-centered evaluation capabilities: Did it improve patient outcomes? Did it augment physician expertise effectively? When measured against these standards, Watson struggled to prove its value, leading many healthcare institutions to abandon the system.
Prioritizing dignity
Microsoft's Seeing AI is an example of what happens when companies measure success through a human-centered lens from the beginning. As Time magazine reported, the Seeing AI app emerged from Microsoft's commitment to accessibility innovation, using computer vision to narrate the visual world for blind and low-vision users.
What sets Seeing AI apart isn't just its technical capabilities but how the development team prioritized human dignity and independence over pure performance metrics. Microsoft worked closely with the blind community throughout the design and testing phases, measuring success not by accuracy percentages alone, but by how effectively the app enhanced the ability of users to navigate their world independently.
This approach created technology that genuinely empowers users, providing real-time audio descriptions that help with everything from selecting groceries to navigating unfamiliar spaces. The lesson: When we start with human outcomes as our primary success metric, we build systems that don't just work—they make life meaningfully better.
Five Critical Dimensions of Success
Smart leaders are moving beyond traditional metrics to evaluate systems across five critical dimensions:
1. Human-AI Collaboration. Rather than measuring performance in isolation, assess how well humans and technology work together. Recent research in the Journal of the American College of Surgeons showed that AI-generated postoperative reports were only half as likely to contain significant discrepancies as those written by surgeons alone. The key insight: a careful division of labor between humans and machines can improve outcomes while leaving humans free to spend more time on what they do best.
2. Ethical Impact and Fairness. Incorporate bias audits and fairness scores as mandatory evaluation metrics. This means continuously assessing whether systems treat all populations equitably and impact human freedom, autonomy, and dignity positively.
3. Stability and Self-Awareness. A Nature Scientific Reports study found performance degradation over time in 91 percent of the models it tested once they were exposed to real-world data. Instead of just measuring a model's out-of-the-box accuracy, track performance over time and assess the model's ability to identify performance dips and escalate to human oversight when its confidence drops.
4. Value Alignment. As the World Economic Forum's 2024 white paper emphasizes, AI models must operate in accordance with core human values if they are to serve humanity effectively. This requires embedding ethical considerations throughout the technology lifecycle.
5. Long-Term Societal Impact Move beyond narrow optimization goals to assess alignment with long-term societal benefits. Consider how technology affects authentic human connections, preserves meaningful work, and serves the broader community good.
Supporting genuine human connection and collaboration
Preserving meaningful human choice and agency
Serving human needs rather than reshaping humans to serve technological needs
The Path Forward
Forward-thinking leaders implement comprehensive evaluation approaches by starting with the desired human outcomes and then establishing continuous human input loops and measuring results against the goals of human stakeholders.
The companies that get this right won't just build better systems—they'll build more trusted, more valuable, and ultimately more successful businesses. They'll create technology that doesn't just process data faster but that genuinely enhances human potential and serves societal needs.
The stakes couldn't be higher. As these AI models become more prevalent in critical decisions around hiring, healthcare, criminal justice, and financial services, our measurement approaches will determine whether these models serve humanity well or perpetuate existing inequalities.
In the end, the most important test of all is whether using AI for a task makes human lives genuinely better. The question isn't whether your technology is fast enough but whether it's human enough. That is the only metric that ultimately matters.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Fast Company
a minute ago
- Fast Company
The U.K.'s Online Safety Act has sparked an explosion in VPN downloads
It's a tough time to be an adult online in the U.K. right now. Last week's passage of the Online Safety Act, a law aimed at shielding children from inappropriate (read: adult) content, has brought about a version of the internet that feels more like being back in school. You now essentially need a hall pass, in the form of official ID or a live selfie, to go almost anywhere. The U.K. government clearly aimed the Online Safety Act at restricting access to porn websites. But the law's broad requirements—especially around age verification and content moderation—are sweeping up other parts of the web, too. Because of this, it's a great time to be a VPN provider. Usage of these services, which route traffic through other countries to disguise a user's location and prevent tracking, has jumped more than 1,000% in the days following the act's passage. 'It's no wonder VPN downloads soared in the U.K. over the weekend,' says Kate Ruane, director of the Free Expression Project at the Center for Democracy & Technology. 'Privacy and free expression are human rights, and governments should protect them by passing laws to enhance people's privacy and free expression rights, not endanger them.' There's growing concern that the new rules are causing more harm than they prevent. In trying to keep children away from harmful content, the government may have inadvertently pushed tens—or even hundreds—of thousands of people toward tools that make lawful tracking and oversight far more difficult. Critics argue the implementation has been more performative than effective, with little meaningful enforcement behind the measures. Even those who represent VPN providers are surprised by the sudden surge in interest. 'The surge in VPN usage we've observed across the U.K. is a direct response to the Online Safety Act's extensive controls and age verification requirements,' says Alexey Kozliuk, chair of the VPN Guild, an industry group. VPNs remain legal in the U.K., but their sudden rise in popularity appears to have taken the government by surprise. Officials are reportedly considering restrictions on advertising the services. While VPNs can offer a layer of privacy, not all are trustworthy—especially free options, which may come with risks like data tracking, harvesting, or malware. 'Users should look for transparent privacy policies, a strict no-logs policy, robust encryption, and a proven track record,' Kozliuk says. Cybersecurity experts share that concern. 'The purported benefit of protecting private data by avoiding submission to a third-party age verifier is compromised if instead they entrust another third party with their browsing data,' says George Loukas, professor of cybersecurity at the University of Greenwich. 'Of course, there is a variety of more and less reputable VPNs with and without no-logs policies, but I insist that VPNs should be used as cybersecurity tools, not for circumvention of restrictions,' he adds. Graeme Stewart, head of public sector at Check Point Software, notes that the U.K.'s rush to VPNs—and the government's potential pushback—puts the country in dubious company alongside China, Russia, and Iran. 'That should tell you everything,' he says. 'People are turning to VPNs because they don't trust the system, and who can blame them?'


Forbes
2 minutes ago
- Forbes
AI Guilt And Comms Professionals: Working With Expectation Overload
AI tools are changing how comms pros work—but not without emotional cost. Many feel a quiet sense of ... More AI guilt, wondering if using generative tools undermines their value. The reality: when used with discernment, AI elevates human judgment—it doesn't erase it. Recently, I spoke with Kelley Darling, a comms professional at a multi-division real estate firm in Washington, DC. She essentially does most of the comms herself. Darling has started using AI to keep up with demands across four distinct audiences. But she's wrestling with a feeling I expect is all too familiar: AI guilt among comms professionals. 'It makes my work sharper and more efficient,' Darling told me. 'But I wonder—would people still see the value I bring if they realized I have AI partners helping behind the scenes?' Darling's comment stuck with me. As someone who supports people designing and scaling thought leadership programs, I meet many communications professionals like Darling. They carry the full weight of brand voice, narrative coherence, and content strategy—often as solo contributors. The introduction of AI into their workflows was supposed to be a relief. But instead of reducing pressure, it often introduces a quiet, creeping question: Am I cheating? Let's name this feeling AI guilt. And let's unpack it. AI Guilt And The High-Wire Act of Modern Thought Leadership Communications professionals—particularly those shepherding thought leadership programs—have never had it easy. They must help surface big ideas, package them elegantly, channel them through diverse media, and measure the results. They must be both the wellspring of creativity and the guardrails of brand integrity. In many organizations, these professionals are not just the engine of thought leadership—they're its lone mechanic, driver, and GPS. It's not unusual for one person to play the role of ghostwriter, editor, strategist, and project manager across multiple teams and initiatives. Some also shoulder the emotional labor of working with subject-matter experts who don't quite understand the invisible lift that creating strategic content requires. It's no wonder the promise of AI is so tempting. AI tools like ChatGPT can offer relief: a sounding board for ideas, a fast draft, a rewriter, a tone checker. Used wisely, they multiply capacity and preserve energy for higher-order thinking. And in the world of thought leadership, where ideation can take time, and every sentence must pull its weight, that's no small gift. But what happens when the relief is tinged with guilt? Some of the guilt stems from old narratives: Real writers don't need help. If you were good at your job, you'd do it all yourself. Or worse: If the AI can do this, why do we need you? These beliefs ignore a simple truth—AI is not ideation. It's not judgment. It's not discernment or audience intuition or editorial strategy. Those are human strengths. AI assists with execution, not invention. For thought leadership professionals, the ideas are the value. The clarity and courage to frame an idea in a way that moves a market or sparks a conversation is still uniquely human. AI can help shape or smooth or structure, but it cannot originate with the same insight born from years of study, client work, and editorial rigor. Another source of guilt is the fear of being 'found out'—as if using AI is a shortcut or a crutch. But in a communications environment where you're expected to 'do more with less' year after year, it's not cheating to use the best tools. It's survival. And smart leadership will recognize that. AI Guilt And Transparency In fact, those building thought leadership functions inside organizations should be leading the charge in adopting AI—not hiding it. AI enables faster content iteration, testing of different angles for different audiences, and more frequent publishing without burnout. For firms investing in a thought leadership culture, that matters. Research by Bob Buday and others has shown that thought leadership is no longer a niche marketing function—it's a competitive strategy. Companies with strong thought leadership engines gain more traction with buyers, more trust in the marketplace, and more influence with clients. If comms professionals are tasked with building this strategic muscle, they deserve to use the best available tools to do so. Darling described how she is experimenting with AI to write a single newsletter differently for four employee personas. That's exactly the kind of work that moves content from noise to nuance. It requires understanding what matters to each audience, testing language, and being able to consider different variations quickly. AI supports her judgment—it doesn't replace it. To Conquer AI Guilt, Embrace the Tools And Own the Process The way forward is not to pretend you're not using AI. And leaders in companies should not be putting their employees in a position where they can't be open about it. The way forward is to use generative AI transparently, wisely, and strategically. As thought leadership becomes central to brand identity and differentiation, comms pros need space to think—not just execute. They need time to ideate, to frame, to test. AI can free up that space—but only if we stop apologizing for it. Let's rewrite the narrative: You're not less valuable because you use AI. You're more valuable because you use it well. In thought leadership, the real measure of value isn't how fast you write or how many words you produce. It's how clearly and originally you think—and how well you help others do the same. Using AI thoughtfully is not just acceptable—it's strategic. The best thought leadership professionals I know treat generative AI as a partner in the creative process, not a threat to their credibility. They use it to test their assumptions, to sharpen their hooks, to find new metaphors, and to get out of ruts faster. And yet, even those leading the charge may feel the tension between innovation and authenticity. That tension is a sign of professional integrity. It means you care about the quality of your work. It means you haven't outsourced your standards. Thought leadership, at its best, is a disciplined form of meaning-making. It's about surfacing ideas that aren't just smart, but useful—ideas that can reshape how people think, work, and lead. If AI can help you bring those ideas into the world with more precision and less burnout, I say you should use it. Comms and thought-leadership professionals need to stop whispering about the tools we rely on and start focusing on the value we create with them. Again, thought leadership is about thinking well and helping others do the same. If you're doing that with the help of AI, you're not falling short—you're showing the way forward. And there's no need for AI guilt in that scenario.


WIRED
2 minutes ago
- WIRED
Big Tech Asked for Looser Clean Water Act Permitting. Trump Wants to Give It to Them
Jul 29, 2025 12:39 PM New AI regulations suggested by the White House mirror changes to environmental permitting suggested by Meta and a lobbying group representing firms like Google and Amazon Web Services. An Amazon Web Services data center in Manassas, Virginia. Photograph:All products featured on WIRED are independently selected by our editors. However, we may receive compensation from retailers and/or from purchases of products through these links. Last week, the Trump administration announced a set of sweeping AI policy recommendations to 'usher in a new golden age of human flourishing.' Among the suggested environmental rollbacks laid out in both an executive order and a corresponding AI Action Plan is a set of specific recommendations to essentially loosen Clean Water Act permitting processes for data centers. They mirror suggestions made to regulators earlier this year by both a major data center lobbying group and Meta. In March, the Data Center Coalition (DCC), a lobbying group for the industry whose members include tech giants like Google and Amazon Web Services, wrote a public comment to the Office of Science and Technology Policy in response to its request for information to develop the AI Action Plan. In the comment, the DCC suggested smoothing the permitting process for data centers under a specific section of the Clean Water Act. A similar ask around this section of the law was also made by DCC member Meta in its own separate response. The part of the Clean Water Act specifically named in these comments and in the recommendations from the White House deals with how projects like data centers could impact federally protected waters during construction or use, and what materials are discharged into those waters or dredged from them. Activities that range from building a bridge or road to filling marshland to construct a building or redirecting a stream require a permit under Section 404 of the law. Getting these types of permits, known as 404 permits, for individual projects can be expensive and time-consuming. But the government allows exemptions for a list of specific activities and industries, creating a nationwide permit that results in a process with less public participation and federal review. It's this type of nationwide permit that the new Trump AI agenda seeks to gain for data centers. The AI Action Plan also recommends exempting data centers from what's known as pre-construction notification, an additional form that helps regulators understand the impacts of a project before it begins—another proposal that was in the DCC public comment. 'The data center industry takes compliance and accountability seriously and works closely with the many local, state, regional, and federal bodies responsible for permitting and project approvals, regulation in environmental, safety, and other key areas, and oversight,' Cy McNeill, the director of federal affairs at DCC, told WIRED in an emailed comment. Environmental lawyers who spoke with WIRED stressed that direct impacts from data centers depend entirely on the specifics of each individual project. Many data centers have relatively low environmental profiles for the buildings themselves. Filling in a marshy corner of a vacant lot is a practice done all over the country for all types of construction. 'For a while there was a joke that Walmarts were being built on wetlands, because it's like, well, where's the land that hasn't already been developed?' says Jim McElfish, a senior adviser at the Environmental Law Institute, a research nonprofit. There are currently more than 50 issued nationwide 404 permits—some of which still require pre-construction notifications—which are renewed once every five years. Many of those exemptions are for agricultural activities, like cranberry harvesting and constructing ponds for farms, or ecosystem and scientific services like surveying and soil maintenance. Some types of coal mining and oil and gas activity are also included in the program. Buildings like stores, restaurants, hospitals, and schools currently have their own nationwide permit, which some data centers fall under. However, the permit requires a more in-depth, individual analysis if the project impacts more than half an acre of protected water. The DCC in its March comment recommended the creation of a nationwide permit with 'robust notification and coverage thresholds' and argued that 'lengthy timelines for the approvals are not consistent with other national permits that have higher or no limits or have a threshold where a PCN is not needed, which allows immediate action.' Meta, which has announced its intent to build massive data centers across multiple states and is currently developing a 2,250-acre data center in Louisiana, also asked for a nationwide permit in its comment and suggested that the federal government further 'streamline' the 404 permitting process. Meta's chief global affairs officer Joel Kaplan posted on X last week that the AI Action Plan 'is a bold step to create the right regulatory environment for companies like ours to invest in America,' and that Meta is 'investing hundreds of billions of dollars in job-creating infrastructure across the US, including state-of-the-art data centers.' Meta declined to comment further for this article through a spokesperson. Environmental lawyers aren't so sure that a nationwide permit for data centers, regardless of their size, would follow the intent of the Clean Water Act. 'What makes [a blanket data center exemption] a little bit tricky is that the impacts are gonna differ quite a bit depending on where these are,' McElfish says. While one data center may impact just a 'fraction of an acre,' he says, by rebuilding a stream crossing or filling in a wetland, other data centers in different areas of the country may have much larger impacts to local waterways during their construction. Hannah Connor, a senior attorney at the Center for Biological Diversity, agrees. 'What we're seeing here is an attempt to expand the 404 nationwide permitting program so that it goes through this much reduced regulatory review outside of the intention of why [the permitting] program was created,' she says. 'There's much reduced regulatory review to kind of literally speed along the paving of wetlands.' There are some data center projects in development today that have run into significant issues with federally protected waters. In Indiana, Amazon is currently galvanizing local opposition as it attempts to fill in nearly 10 acres of wetland and more than 5,000 streams to build a massive data center. In Alabama, environmentalists caution that the water footprint from a proposed data center could have serious impacts on local waterways and cause the possible extinction of a species of fish. In a response to a request to comment from WIRED, Amazon spokesperson Heather Layman sent several details via email on the company's global water replenishment projects and its efforts to conserve water at its Indiana data centers. 'To maintain global leadership in AI, the US must prioritize the deployment of energy generation and infrastructure to support data center growth,' she wrote. 'We are also constantly working to optimize our water consumption across Amazon's operations.' The proposed changes from the White House are no surprise to lawyers like Connor: 'The 404 permitting program has had a developer target on its back for a pretty long time,' she says. Sackett v. EPA , the 2023 Supreme Court case that dealt a major blow to the reach of the Clean Water Act, was based on a 404 permitting issue. This ruling, Connor suspects, is partly why we may be seeing so many companies choosing to build data centers in dry states like Arizona. 'They have a lot of waters that have lost their jurisdictional reach within the Clean Water Act,' she says. 'It's just easier to pave over the desert, which is the saddest thing to say out loud.' The coal industry has also battled for more than a decade to get nationwide 404 permits for mountaintop mining. Meanwhile, 404 permits for gas pipelines like the Mountain Valley Pipeline have been locked in years of litigation. The executive order signed Wednesday calls for a review of 404 permits not just for data centers, but for a host of other 'covered components' that are used to build data centers, ranging from transmission lines to gas pipelines to coal and nuclear power equipment. 'The energy objectives of the administration are baked into [the new AI policies],' says Connor. Trump officials, she says, are 'trying to create more dexterity in the 404 program for all the kinds of means that [they] want to be reflected within the administration's priorities. That includes fossil fuels, that includes coal, and that includes data centers.'