I tested the future of AI image generation. It's astoundingly fast.

Yahoo23-03-2025

One of the core problems with AI is the notoriously high power and computing demand, especially for tasks such as media generation. On mobile phones, when it comes to running natively, only a handful of pricey devices with powerful silicon can run the feature suite. Even when implemented at scale on cloud, it's a pricey affair.
Nvidia may have quietly addressed that challenge in partnership with the folks over at the Massachusetts Institute of Technology and Tsinghua University. The team created a hybrid AI image generation tool called HART (hybrid autoregressive transformer) that essentially combines two of the most widely used AI image creation techniques. th result is a blazing fast tool with dramatically lower compute requirement.
Just to give you an idea of just how fast it is, I asked it to create an image of a parrot playing a bass guitar. It returned with the following picture in just about a second. I could barely even follow the progress bar. When I pushed the same prompt before Google's Imagen 3 model in Gemini, it took roughly 9-10 seconds on a 200 Mbps internet connection.
When AI images first started making waves, the diffusion technique was behind it all, powering products such as OpenAI's Dall-E image generator, Google's Imagen, and Stable Diffusion. This method can produce images with an extremely high level of detail. However, it is a multi-step approach to creating AI images, and as a result, it is slow and computationally expensive.
The second approach that has recently gained popularity is auto-regressive models, which essentially work in the same fashion as chatbots and generate images using a pixel prediction technique. It is faster, but also a more error-prone method of creating images using AI.
The team at MIT fused both methods into a single package called HART. It relies on an autoregression model to predict compressed image assets as a discrete token, while a small diffusion model handles the rest to compensate for the quality loss. The overall approach reduces the number of steps involved from over two dozen to eight steps.
The experts behind HART claim that it can 'generate images that match or exceed the quality of state-of-the-art diffusion models, but do so about nine times faster.' HART combines an autoregressive model with a 700 million parameter range and a small diffusion model that can handle 37 million parameters.
Interestingly, this hybrid tool was able to create images that matched the quality of top-shelf models with a 2 billion parameter capacity. Most importantly, HART was able to achieve that milestone at a nine times faster image generation rate, while requiring 31% less computation resources.
As per the team, the low-compute approach allows HART to run locally on phones and laptops, which is a huge win. So far, the most popular mass-market products such as ChatGPT and Gemini require an internet connection for image generation as the computing happens in the cloud servers.
In the test video, the team showcased it running natively on an MSI laptop with Intel's Core series processor and an Nvidia GeForce RTX graphics card. That's a combination you can find on a majority of gaming laptops out there, without spending a fortune, while at it.
HART is capable of producing 1:1 aspect ratio images at a respectable 1024 x 1024 pixels resolution. The level of detail in these images is impressive, and so is the stylistic variation and scenery accuracy. During their tests, the team noted that the hybrid AI tool was anywhere between three to six times faster and offered over seven times higher throughput.
The future potential is exciting, especially when integrating HART's image capabilities with language models. 'In the future, one could interact with a unified vision-language generative model, perhaps by asking it to show the intermediate steps required to assemble a piece of furniture,' says the team at MIT.
They are already exploring that idea, and even plan to test the HART approach at audio and video generation. You can try it out on MIT's web dashboard.
Before we dive into the quality debate, do keep in mind that HART is very much a research project that is still in its early stages. On the technical side, there are a few hassles highlighted by the team, such as overheads during the inference and training process.
The challenges can be fixed or overlooked, because they are minor in the bigger scheme of things here. Moreover, considering the sheer benefits HART delivers in terms of computing efficiency, speed, and latency, they might just persist without leading to any major performance issues.
In my brief time prompt-testing HART, I was astonished by the pace of image generation. I barely ran into a scenario where the free web tool took more than two seconds to create an image. Even with prompts that span three paragraphs (roughly over 200 words in length), HART was able to create images that adhere tightly to the description.
Aside from descriptive accuracy, there was plenty of detail in the images. However, HART suffers from the typical failings of an AI image generator tool. It struggles with digits, basic depictions like eating food items, character consistency, and failing at perspective capture.
Photorealism in human context is one area where I noticed glaring failures. On a few occasions, it simply got the concept of basic objects wrong, like confusing a ring with a necklace. But overall, those errors were far, few, and fundamentally expected. A healthy bunch of AI tools still can't get that right, despite being out there for a while now.
Overall, I am particularly excited by the immense potential of HART. It would be interesting to see whether MIT and Nvidia create a product out of it, or simply adopt the hybrid AI image generation approach in an existing product. Either way, it's a glimpse into a very promising future.

Hashtags

#MassachusettsInstituteofTechnology

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Behind the Curtain: The scariest AI reality

Axios

10 minutes ago

Axios

Behind the Curtain: The scariest AI reality

The wildest, scariest, indisputable truth about AI's large language models is that the companies building them don't know exactly why or how they work. Sit with that for a moment. The most powerful companies, racing to build the most powerful superhuman intelligence capabilities — ones they readily admit occasionally go rogue to make things up, or even threaten their users — don't know why their machines do what they do. Why it matters: With the companies pouring hundreds of billions of dollars into willing superhuman intelligence into a quick existence, and Washington doing nothing to slow or police them, it seems worth dissecting this Great Unknown. None of the AI companies dispute this. They marvel at the mystery — and muse about it publicly. They're working feverishly to better understand it. They argue you don't need to fully understand a technology to tame or trust it. Two years ago, Axios managing editor for tech Scott Rosenberg wrote a story, "AI's scariest mystery," saying it's common knowledge among AI developers that they can't always explain or predict their systems' behavior. And that's more true than ever. Yet there's no sign that the government or companies or general public will demand any deeper understanding — or scrutiny — of building a technology with capabilities beyond human understanding. They're convinced the race to beat China to the most advanced LLMs warrants the risk of the Great Unknown. The House, despite knowing so little about AI, tucked language into President Trump's "Big, Beautiful Bill" that would prohibit states and localities from any AI regulations for 10 years. The Senate is considering limitations on the provision. Neither the AI companies nor Congress understands the power of AI a year from now, much less a decade from now. The big picture: Our purpose with this column isn't to be alarmist or " doomers." It's to clinically explain why the inner workings of superhuman intelligence models are a black box, even to the technology's creators. We'll also show, in their own words, how CEOs and founders of the largest AI companies all agree it's a black box. Let's start with a basic overview of how LLMs work, to better explain the Great Unknown: LLMs — including Open AI's ChatGPT, Anthropic's Claude and Google's Gemini — aren't traditional software systems following clear, human-written instructions, like Microsoft Word. In the case of Word, it does precisely what it's engineered to do. Instead, LLMs are massive neural networks — like a brain — that ingest massive amounts of information (much of the internet) to learn to generate answers. The engineers know what they're setting in motion, and what data sources they draw on. But the LLM's size — the sheer inhuman number of variables in each choice of "best next word" it makes — means even the experts can't explain exactly why it chooses to say anything in particular. We asked ChatGPT to explain this (and a human at OpenAI confirmed its accuracy): "We can observe what an LLM outputs, but the process by which it decides on a response is largely opaque. As OpenAI's researchers bluntly put it, 'we have not yet developed human-understandable explanations for why the model generates particular outputs.'" "In fact," ChatGPT continued, "OpenAI admitted that when they tweaked their model architecture in GPT-4, 'more research is needed' to understand why certain versions started hallucinating more than earlier versions — a surprising, unintended behavior even its creators couldn't fully diagnose." Anthropic — which just released Claude 4, the latest model of its LLM, with great fanfare — admitted it was unsure why Claude, when given access to fictional emails during safety testing, threatened to blackmail an engineer over a supposed extramarital affair. This was part of responsible safety testing — but Anthropic can't fully explain the irresponsible action. Again, sit with that: The company doesn't know why its machine went rogue and malicious. And, in truth, the creators don't really know how smart or independent the LLMs could grow. Anthropic even said Claude 4 is powerful enough to pose a greater risk of being used to develop nuclear or chemical weapons. OpenAI's Sam Altman and others toss around the tame word of " interpretability" to describe the challenge. "We certainly have not solved interpretability," Altman told a summit in Geneva last year. What Altman and others mean is they can't interpret the why: Why are LLMs doing what they're doing? Anthropic CEO Dario Amodei, in an essay in April called "The Urgency of Interpretability," warned: "People outside the field are often surprised and alarmed to learn that we do not understand how our own AI creations work. They are right to be concerned: this lack of understanding is essentially unprecedented in the history of technology." Amodei called this a serious risk to humanity — yet his company keeps boasting of more powerful models nearing superhuman capabilities. Anthropic has been studying the interpretability issue for years, and Amodei has been vocal about warning it's important to solve. In a statement for this story, Anthropic said: "Understanding how AI works is an urgent issue to solve. It's core to deploying safe AI models and unlocking [AI's] full potential in accelerating scientific discovery and technological development. We have a dedicated research team focused on solving this issue, and they've made significant strides in moving the industry's understanding of the inner workings of AI forward. It's crucial we understand how AI works before it radically transforms our global economy and everyday lives." (Read a paper Anthropic published last year, "Mapping the Mind of a Large Language Model.") Elon Musk has warned for years that AI presents a civilizational risk. In other words, he literally thinks it could destroy humanity, and has said as much. Yet Musk is pouring billions into his own LLM called Grok. "I think AI is a significant existential threat," Musk said in Riyadh, Saudi Arabia, last fall. There's a 10%-20% chance "that it goes bad." Reality check: Apple published a paper last week, "The Illusion of Thinking," concluding that even the most advanced AI reasoning models don't really "think," and can fail when stress-tested. The study found that state-of-the-art models (OpenAI's o3-min, DeepSeek R1 and Anthropic's Claude-3.7-Sonnet) still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero "beyond certain complexities." But a new report by AI researchers, including former OpenAI employees, called " AI 2027," explains how the Great Unknown could, in theory, turn catastrophic in less than two years. The report is long and often too technical for casual readers to fully grasp. It's wholly speculative, though built on current data about how fast the models are improving. It's being widely read inside the AI companies. It captures the belief — or fear — that LLMs could one day think for themselves and start to act on their own. Our purpose isn't to alarm or sound doomy. Rather, you should know what the people building these models talk about incessantly. You can dismiss it as hype or hysteria. But researchers at all these companies worry LLMs, because we don't fully understand them, could outsmart their human creators and go rogue. In the AI 2027 report, the authors warn that competition with China will push LLMs potentially beyond human control, because no one will want to slow progress even if they see signs of acute danger. The safe-landing theory: Google's Sundar Pichai — and really all of the big AI company CEOs — argue that humans will learn to better understand how these machines work and find clever, if yet unknown ways, to control them and " improve lives." The companies all have big research and safety teams, and a huge incentive to tame the technologies if they want to ever realize their full value.

How to get the most out of Google's free AI Studio

Fast Company

14 minutes ago

Fast Company

How to get the most out of Google's free AI Studio

Google's AI Studio and Labs let you experiment for free with new AI tools. I love the way these digital sandboxes—like the one from Hugging Face —let you try out creative new uses of AI. You can dabble around then download and share what you make, without having to master a complex new platform. Read on for a few Google AI experiments to try. All are free, fast, and easy to use. 1. Transform an image Upload a photo and use Gemini's AI Studio Image Generation to transform it with prompts. Iterate on your original image until you get a version you like. The model understands natural language, so you don't have to master prompt lingo. 2. Generate an AI voice conversation AI-generated voices are increasingly hard to distinguish from human ones. If you're surprised, try Generate Speech in the AI Studio or Google's NotebookLM. How to use Generate Speech in Google's AI Studio Paste in text, either for a narration or a conversation between two people Open the settings tab to pick from 30 AI voices. Each is labeled with a characteristic—e.g. upbeat, gravelly, or mature. Click run to generate the conversation. Optionally adjust the playback speed. Download the file if you want to keep it, or paste in different text to try again. Example: a silly 90-sec chat between two violinists I scripted with Gemini and rendered quickly with this Generate Speech tool. Use case: Make a narration track for an instructional video. ElevenLabs has a better professional model for this, but AI Studio's is free, easy and quick. Alternatives Google's Gemini AI app can also now generate audio overviews from files you upload, if you're on a paid plan. Google's free NotebookLM has a new mobile app, and now lets you generate an audio conversation in any of 50 languages. Unlike Generate Speech in AI Studio, NotebookLM audio overviews summarize your material, they don't perform words as written. Why NotebookLM is so useful. Google's Illuminate lets you generate, listen to, share, and download AI conversations about research papers and famous books. Here's an audio chat about David Copperfield, for example. A bit dry to listen to, but still useful. 3. Make a gif Alternative: You can also make a static image with Google's Imagen 3 or the new Imagen 4. Write a short prompt and select your preferred aspect ratio. So far I still prefer Ideogram (why I like it) and ChatGPT's new image engine. 4. Generate a short video Google's Veo 2 and Flow let you generate free short video clips almost instantly with a prompt. Create a clip to add vibrancy or humor to a presentation, or a visual metaphor to help you explain something. Here are 25 other quick ideas for how you might use little AI-generated video scenes. How to create a video clip with Veo 2 Pick a length (5 to 8 seconds) and select horizontal or vertical orientation Write a prompt & optionally upload a photo to suggest a visual direction Example: Take a look at a parakeet photo I started with and the 5-second video I generated from the photo with Veo 2. Tip: Convert short video clips into gifs for free with Ezgif or Giphy. Unlike video files, gifs are easy to share and auto-play in an email or presentation. What's next: Remarkably lifelike clips made with Google's newer Veo 3 model went viral this week. These AI-generated visuals—with sound—are only available on the $250/month(!) plan for now, so try Veo 2 for free. 5. Explain things with lots of tiny cats This playful mini app creates short, step-by-step visual guides using charming cat illustrations to explain any concept, from how a violin works to the concept behind the matrix.

The way you program an AI is like the way you program a person, says Nvidia's Huang

CNBC

20 minutes ago

CNBC

The way you program an AI is like the way you program a person, says Nvidia's Huang

Nvidia CEO Jensen Huang says artificial intelligence is the "great equalizer" because it lets anyone program using everyday language. Speaking at London Tech Week on Monday, Huang said that, historically, computing was hard and not available to everyone. "We had to learn programming languages. We had to architect it. We had to design these computers that are very complicated," he said on stage alongside U.K. Prime Minister Kier Starmer. "Now, all of a sudden ... there's a new programming language. This new programming language is called 'human.'" Conversational AI models were thrown into the spotlight in 2022 when OpenAI's ChatGPT exploded onto the scene. In February, the San Francisco-based tech company said it had 400 million weekly active users. Users can ask chatbots, such as ChatGPT, Google's Gemini or Microsoft's Copilot, questions and they respond in a conversational way that feels more like talking to another human than an AI system. CEO Huang, whose company engineers some of the world's most advanced semiconductors and AI chips, highlighted that this technology can now be used in programming. He highlighted that very few people know how to use programming languages like C++ or Python, but "everybody ... knows 'human'." "The way you program a computer today, to ask the computer to do something for you, even write a program, generate images, write a poem — just ask it nicely," he said. "And the thing that's really, really quite amazing is the way you program an AI is like the way you program a person." He gave the example of simply asking a computer to write a poem to describe the keynote speech at the London Tech Week event. "You say: You are an incredible poet ... And I would like you to write a poem to describe today's keynote. And without very much effort, this AI would help you generate such a wonderful poem," he said. "And when it answers ... you could say: I feel like you could do even better. And it would go off and think about it, and it'll come back and say, in fact, I I can do better, and it does do a better job." Huang's comments come as a growing number of companies — such as Shopify, Duolingo and Fiverr — encourage their employees to incorporate AI into their work. Indeed, last week OpenAI announced that it has 3 million paying business users. Huang regularly touts AI's ability to help workers do their jobs more efficiently and has encouraged workers to embrace the technology as they look to make themselves valuable employees — especially given the horror stories around AI's potential to replace jobs. "This way of interacting with computers, I think, is something that almost anybody can do, and I would just encourage everybody to engage it," Huang added on Monday. "Children are already doing that themselves naturally, and this is going to be transformative.