AI models can outperform humans in tests to identify mental states

Humans are complicated beings. The ways we communicate are multilayered, and psychologists have devised many kinds of tests to measure our ability to infer meaning and understanding from interactions with each other. 

AI models are getting better at these tests. New research published today in Nature Human Behavior found that some large language models (LLMs) perform as well as, and in some cases better than, humans when presented with tasks designed to test the ability to track people’s mental states, known as “theory of mind.” 

This doesn’t mean AI systems are actually able to work out how we’re feeling. But it does demonstrate that these models are performing better and better in experiments designed to assess abilities that psychologists believe are unique to humans. To learn more about the processes behind LLMs’ successes and failures in these tasks, the researchers wanted to apply the same systematic approach they use to test theory of mind in humans.

In theory, the better AI models are at mimicking humans, the more useful and empathetic they can seem in their interactions with us. Both OpenAI and Google announced supercharged AI assistants last week; GPT-4o and Astra are designed to deliver much smoother, more naturalistic responses than their predecessors. But we must avoid falling into the trap of believing that their abilities are humanlike, even if they appear that way. 

“We have a natural tendency to attribute mental states and mind and intentionality to entities that do not have a mind,” says Cristina Becchio, a professor of neuroscience at the University Medical Center Hamburg-Eppendorf, who worked on the research. “The risk of attributing a theory of mind to large language models is there.”

Theory of mind is a hallmark of emotional and social intelligence that allows us to infer people’s intentions and engage and empathize with one another. Most children pick up these kinds of skills between three and five years of age. 

The researchers tested two families of large language models, OpenAI’s GPT-3.5 and GPT-4 and three versions of Meta’s Llama, on tasks designed to test the theory of mind in humans, including identifying false beliefs, recognizing faux pas, and understanding what is being implied rather than said directly. They also tested 1,907 human participants in order to compare the sets of scores.

The team conducted five types of tests. The first, the hinting task, is designed to measure someone’s ability to infer someone else’s real intentions through indirect comments. The second, the false-belief task, assesses whether someone can infer that someone else might reasonably be expected to believe something they happen to know isn’t the case. Another test measured the ability to recognize when someone is making a faux pas, while a fourth test consisted of telling strange stories, in which a protagonist does something unusual, in order to assess whether someone can explain the contrast between what was said and what was meant. They also included a test of whether people can comprehend irony. 

The AI models were given each test 15 times in separate chats, so that they would treat each request independently, and their responses were scored in the same manner used for humans. The researchers then tested the human volunteers, and the two sets of scores were compared. 

Both versions of GPT performed at, or sometimes above, human averages in tasks that involved indirect requests, misdirection, and false beliefs, while GPT-4 outperformed humans in the irony, hinting, and strange stories tests. Llama 2’s three models performed below the human average.

However, Llama 2, the biggest of the three Meta models tested, outperformed humans when it came to recognizing faux pas scenarios, whereas GPT consistently provided incorrect responses. The authors believe this is due to GPT’s general aversion to generating conclusions about opinions, because the models largely responded that there wasn’t enough information for them to answer one way or another.

“These models aren’t demonstrating the theory of mind of a human, for sure,” he says. “But what we do show is that there’s a competence here for arriving at mentalistic inferences and reasoning about characters’ or people’s minds.”

One reason the LLMs may have performed as well as they did was that these psychological tests are so well established, and were therefore likely to have been included in their training data, says Maarten Sap, an assistant professor at Carnegie Mellon University, who did not work on the research. “It’s really important to acknowledge that when you administer a false-belief test to a child, they have probably never seen that exact test before, but language models might,” he says.

Ultimately, we still don’t understand how LLMs work. Research like this can help deepen our understanding of what these kinds of models can and cannot do, says Tomer Ullman, a cognitive scientist at Harvard University, who did not work on the project. But it’s important to bear in mind what we’re really measuring when we set LLMs tests like these. If an AI outperforms a human on a test designed to measure theory of mind, it does not mean that AI has theory of mind.
“I’m not anti-benchmark, but I am part of a group of people who are concerned that we’re currently reaching the end of usefulness in the way that we’ve been using benchmarks,” Ullman says. “However this thing learned to pass the benchmark, it’s not— I don’t think—in a human-like way.”

A device that zaps the spinal cord gave paralyzed people better control of their hands

Fourteen years ago, a journalist named Melanie Reid attempted a jump on horseback and fell. The accident left her mostly paralyzed from the chest down. Eventually she regained control of her right hand, but her left remained “useless,” she told reporters at a press conference last week. 

Now, thanks to a new noninvasive device that delivers electrical stimulation to the spinal cord, she has regained some control of her left hand. She can use it to sweep her hair into a ponytail, scroll on a tablet, and even squeeze hard enough to release a seatbelt latch. These may seem like small wins, but they’re crucial, Reid says.

“Everyone thinks that [after] spinal injury, all you want to do is be able to walk again. But if you’re a tetraplegic or a quadriplegic, what matters most is working hands,” she said.

Reid received the device, called ARCex, as part of a 60-person clinical trial. She and the other participants completed two months of physical therapy, followed by two months of physical therapy combined with stimulation. The results, published today in Nature Medicine, show that the vast majority of participants benefited. By the end of the four-month trial, 72% experienced some improvement in both strength and function of their hands or arms when the stimulator was turned off. Ninety percent had improvement in at least one of those measures. And 87% reported an improvement in their quality of life.

This isn’t the first study to test whether noninvasive stimulation of the spine can help people who are paralyzed regain function in their upper body, but it’s important because a trial has never been done before in this number of rehabilitation centers or in this number of subjects, says Igor Lavrov, a neuroscientist at the Mayo Clinic in Minnesota, who was not involved in the study. He points out, however, that the therapy seems to work best in people who have some ability to move below the site of their injury. 

The trial was the last hurdle before the researchers behind the device could request regulatory approval, and they hope it might be approved in the US by the end of the year.

ARCex consists of a small stimulator connected by wires to electrodes placed on the spine—in this case, in the area responsible for hand and arm control, just below the neck. It was developed by Onward Medical, a company cofounded by Grégoire Courtine, a neuroscientist at the Swiss Federal Institute of Technology in Lausanne and now chief scientific officer at the company.

The stimulation won’t work in the small percentage of people who have no remaining connection between the brain and spine below their injury. But for people who still have a connection, the stimulation appears to make  voluntary movements easier by making the nerves more likely to transmit a signal. Studies over the past couple of decades in animals suggest that the stimulation activates remaining nerve fibers and, over time, helps new nerves grow. That’s why the benefits persist even when the stimulator is turned off.

The big advantage of an external stimulation system over an implant is that it doesn’t require surgery, which makes using the device less of a commitment. “There are many, many people who are not interested in invasive technologies,” said Edelle Field-Fote, director of research on spinal cord injury at the Shepherd Center, at the press conference. An external device is also likely to be cheaper than any surgical options, although the company hasn’t yet set a price on ARCex. 

“What we’re looking at here is a device that integrates really seamlessly with the physical therapy and occupational therapy that’s already offered in the clinic,” said Chet Moritz, an engineer and neuroscientist at the University of Washington in Seattle, at the press conference. The rehab that happens soon after the injury is crucial, because that’s when the opportunity for recovery is greatest. “Being able to bring that function back without requiring a surgery could be life-changing for the majority of people with spinal cord injury,” he adds.

Reid wishes she could have used the device soon after her injury, but she is astonished by the amount of function she was able to regain after all this time. “After 14 years, you think, well, I am where I am and nothing’s going change,” she says. So to suddenly find she had strength and power in her left hand—“It was extraordinary,” she says.

Onward is also developing implantable devices, which can deliver stronger, more targeted stimulation and thus could be effective even in people with complete paralysis. The company hopes to launch a trial of those next year.

The Download: GPT-4o’s polluted Chinese training data, and astronomy’s AI challenge

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o last Monday, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. GPT-4o is supposed to be better than its predecessors at handling multi-language tasks, and many of the advances were achieved through a new tokenization tool that does a better job compressing texts in non-English languages.

But, at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases—and experts say that’s likely due to insufficient data cleaning and filtering before the tokenizer was trained. If left unresolved, it could lead to hallucinations, poor performance, and misuse. Read the full story.

—Zeyi Yang

Astronomers are enlisting AI to prepare for a data downpour

In deserts across Australia and South Africa, astronomers are planting forests of metallic detectors that will together scour the cosmos for radio signals. When it boots up in five years or so, the Square Kilometer Array Observatory will look for new information about the universe’s first stars and the different stages of galactic evolution. 

But after synching hundreds of thousands of dishes and antennas, astronomers will quickly face a new challenge: combing through some 300 petabytes of cosmological data a year—enough to fill a million laptops. So in preparation for the information deluge, astronomers are turning to AI for assistance. Read the full story.

—Zack Savitsky

Join us for Future Compute

If you’re interested in learning more about how to navigate the rapid changes in technology, Future Compute is the conference for you. It’s designed to help teach leaders strategic vision, agility, and a deep understanding of emerging technologies, and is held tomorrow, May 21, on MIT’s campus. Join us in-person or online by registering today.

EmTech Digital kicks off this week

The pace of AI development is truly breakneck these days—and we’ve got a sneak peek at what’s coming next. If you want to learn about how Google plans to develop and deploy AI, come and hear from its vice president of AI, Jay Yagnik, at our flagship AI conference, EmTech Digital

We’ll hear from OpenAI about its video generation model Sora too, and Nick Clegg, Meta’s president of global affairs, will also join MIT Technology Review’s executive editor Amy Nordrum for an exclusive interview on stage. 

It’ll be held at the MIT campus and streamed live online this week on May 22-23. Readers of The Download get 30% off tickets with the code DOWNLOADD24—here’s how to register. See you there!

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 Apple is teaming up with OpenAI to overhaul iOS18 
In the hopes it’ll give Apple an edge over rivals Google and Microsoft. (Bloomberg $)
+ OpenAI and Google recently launched their own supercharged AI assistants. (MIT Technology Review)

2 Blue Origin took six customers to the edge of space on Sunday
It’s the company’s first tourist flight in almost two years. (CNN)
+ Space tourism hasn’t exactly got off the ground yet. (WP $)

3 How TikTok users are skirting around its weight-loss drug promotion ban
Talking in code is becoming increasingly common. (WP $)
+ A new kind of weight-loss therapy is on the horizon. (Fast Company $)
+ What don’t we know about Ozempic? Quite a lot, actually. (Vox)
+ Weight-loss injections have taken over the internet. But what does this mean for people IRL? (MIT Technology Review)

4 Chinese companies are pushing ‘AI-in-a-box’ products
They’re sold as all-in-one cloud computing solutions, much to cloud providers’ chagrin. (FT $)

5 Microscopic blood clots could explain the severity of long covid 
But doctors are calling for rigorous peer review before any solid conclusions can be made. (Undark Magazine)
+ Scientists are finding signals of long covid in blood. They could lead to new treatments. (MIT Technology Review)

6 How hackers saved stalled Polish trains
It looks as though the locomotives’ manufacturer could be behind the breakdown. (WSJ $)

7 We’re getting closer to making an HIV vaccine
A successful trial is giving researchers new hope. (Wired $)
+ Three people were gene-edited in an effort to cure their HIV. The result is unknown. (MIT Technology Review)

8 Most healthy people don’t need to track their blood glucose
That doesn’t stop companies trying to sell you their monitoring services, though. (The Guardian)

9 Filming strangers is public is not okay
And yet, people keep doing it. Why? (Vox)

10 Beware the spread of AI slop
Spam is no longer a strong enough term—the latest wave of AI images is slop. (The Guardian)

Quote of the day

“It’s a process of trust collapsing bit by bit, like dominoes falling one by one.”

—An anonymous OpenAI insider tells Vox that safety-minded employees are losing faith in the company’s CEO Sam Altman.

The big story

What does GPT-3 “know” about me?

August 2022

One of the biggest stories in tech is the rise of large language models that produce text that reads like a human might have written it.

These models’ power comes from being trained on troves of publicly available human-created text hoovered up from the internet. If you’ve posted anything even remotely personal in English on the internet, chances are your data might be part of some of the world’s most popular LLMs.

Melissa Heikkilä, MIT Technology Review’s AI reporter, wondered what data these models might have on her—and how it could be misused. So she put OpenAI’s GPT-3 to the test. Read about what she found.

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or tweet ’em at me.)

+ Sea urchins just love tiny hats 🎩
+ There’s nothing better than a Lego optical illusion of sorts.
+ Waking up each morning can be tough. Maybe a better alarm is the way forward?
+ Out of the way: it’s the annual worm charming championships! 🪱

Astronomers are enlisting AI to prepare for a data downpour

In deserts across Australia and South Africa, astronomers are planting forests of metallic detectors that will together scour the cosmos for radio signals. When it boots up in five years or so, the Square Kilometer Array Observatory will look for new information about the universe’s first stars and the different stages of galactic evolution. 

But after synching hundreds of thousands of dishes and antennas, astronomers will quickly face a new challenge: combing through some 300 petabytes of cosmological data a year—enough to fill a million laptops. 

It’s a problem that will be repeated in other places over the coming decade. As astronomers construct giant cameras to image the entire sky and launch infrared telescopes to hunt for distant planets, they will collect data on unprecedented scales. 

“We really are not ready for that, and we should all be freaking out,” says Cecilia Garraffo, a computational astrophysicist at the Harvard-Smithsonian Center for Astrophysics. “When you have too much data and you don’t have the technology to process it, it’s like having no data.”

In preparation for the information deluge, astronomers are turning to AI for assistance, optimizing algorithms to pick out patterns in large and notoriously finicky data sets. Some are now working to establish institutes dedicated to marrying the fields of computer science and astronomy—and grappling with the terms of the new partnership.

In November 2022, Garraffo set up AstroAI as a pilot program at the Center for Astrophysics. Since then, she has put together an interdisciplinary team of over 50 members that has planned dozens of projects focusing on deep questions like how the universe began and whether we’re alone in it. Over the past few years, several similar coalitions have followed Garraffo’s lead and are now vying for funding to scale up to large institutions.

Garraffo recognized the potential utility of AI models while bouncing between career stints in astronomy, physics, and computer science. Along the way, she also picked up on a major stumbling block for past collaboration efforts: the language barrier. Often, astronomers and computer scientists struggle to join forces because they use different words to describe similar concepts. Garraffo is no stranger to translation issues, having struggled to navigate an English-only school growing up in Argentina. Drawing from that experience, she has worked to put people from both communities under one roof so they can identify common goals and find a way to communicate. 

Astronomers had already been using AI models for years, mainly to classify known objects such as supernovas in telescope data. This kind of image recognition will become increasingly vital when the Vera C. Rubin Observatory opens its eyes next year and the number of annual supernova detections quickly jumps from hundreds to millions. But the new wave of AI applications extends far beyond matching games. Algorithms have recently been optimized to perform “unsupervised clustering,” in which they pick out patterns in data without being told what specifically to look for. This opens the doors for models pointing astronomers toward effects and relationships they aren’t currently aware of. For the first time, these computational tools offer astronomers the faculty of “systematically searching for the unknown,” Garraffo says. In January, AstroAI researchers used this method to catalogue over 14,000 detections from x-ray sources, which are otherwise difficult to categorize.

Another way AI is proving fruitful is by sniffing out the chemical composition of the skies on alien planets. Astronomers use telescopes to analyze the starlight that passes through planets’ atmospheres and gets soaked up at certain wavelengths by different molecules. To make sense of the leftover light spectrum, astronomers typically compare it with fake spectra they generate based on a handful of molecules they’re interested in finding—things like water and carbon dioxide. Exoplanet researchers dream of expanding their search to hundreds or thousands of compounds that could indicate life on the planet below, but it currently takes a few weeks to look for just four or five compounds. This bottleneck will become progressively more troublesome as the number of exoplanet detections rises from dozens to thousands, as is expected to happen thanks to the newly deployed James Webb Space Telescope and the European Space Agency’s Ariel Space Telescope, slated to launch in 2029. 

Processing all those observations is “going to take us forever,” says Mercedes López-Morales, an astronomer at the Center for Astrophysics who studies exoplanet atmospheres. “Things like AstroAI are showing up at the right time, just before these faucets of data are coming toward us.”

Last year López-Morales teamed up with Mayeul Aubin, then an undergraduate intern at AstroAI, to build a machine-learning model that could more efficiently extract molecular composition from spectral data. In two months, their team built a model that could scour thousands of exoplanet spectra for the signatures of five different molecules in 31 seconds, a feat that won them the top prize in the European Space Agency’s Ariel Data Challenge. The researchers hope to train a model to look for hundreds of additional molecules, boosting their odds of finding signs of life on faraway planets. 

AstroAI collaborations have also given rise to realistic simulations of black holes and maps of how dark matter is distributed throughout the universe. Garraffo aims to eventually build a large language model similar to ChatGPT that’s trained on astronomy data and can answer questions about observations and parse the literature for supporting evidence. 

“There’s this huge new playground to explore,” says Daniela Huppenkothen, an astronomer and data scientist at the Netherlands Institute for Space Research. “We can use [AI] to tackle problems we couldn’t tackle before because they’re too computationally expensive.” 

However, incorporating AI into the astronomy workflow comes with its own host of trade-offs, as Huppenkothen outlined in a recent preprint. The AI models, while efficient, often operate in ways scientists don’t fully understand. This opacity makes them complicated to debug and difficult to identify how they may be introducing biases. Like all forms of generative AI, these models are prone to hallucinating relationships that don’t exist, and they report their conclusions with an unfounded air of confidence. 

“It’s important to critically look at what these models do and where they fail,” Huppenkothen says. “Otherwise, we’ll say something about how the universe works and it’s not actually true.”

Researchers are working to incorporate error bars into algorithm responses to account for the new uncertainties. Some suggest that the tools could warrant an added layer of vetting to the current publication and peer-review processes. “As humans, we’re sort of naturally inclined to believe the machine,” says Viviana Acquaviva, an astrophysicist and data scientist at the City University of New York who recently published a textbook on machine-learning applications in astronomy. “We need to be very clear in presenting results that are often not clearly explicable while being very honest in how we represent capabilities.”

Researchers are cognizant of the ethical ramifications of introducing AI, even in as seemingly harmless a context as astronomy. For instance, these new AI tools may perpetuate existing inequalities in the field if only select institutions have access to the computational resources to run them. And if astronomers recycle existing AI models that companies have trained for other purposes, they also “inherit a lot of the ethical and environmental issues inherent in those models already,” Huppenkothen says.

Garraffo is working to get ahead of these concerns. AstroAI models are all open source and freely available, and the group offers to help adapt them to different astronomy applications. She has also partnered with Harvard’s Berkman Klein Center for Internet & Society to formally train the team in AI ethics and learn best practices for avoiding biases. 

Scientists are still unpacking all the ways the arrival of AI may affect the field of astronomy. If AI models manage to come up with fundamentally new ideas and point scientists toward new avenues of study, it will forever change the role of the astronomer in deciphering the universe. But even if it remains only an optimization tool, AI is set to become a mainstay in the arsenal of cosmic inquiry. 

“It’s going to change the game,” Garraffo says. “We can’t do this on our own anymore.” 

Zack Savitsky is a freelance science journalist who covers physics and astronomy. 

GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Soon after OpenAI released GPT-4o on Monday, May 13, some Chinese speakers started to notice that something seemed off about this newest version of the chatbot: the tokens it uses to parse text were full of spam and porn phrases.

On May 14, Tianle Cai, a PhD student at Princeton University studying inference efficiency in large language models like those that power such chatbots, accessed GPT-4o’s public token library and pulled a list of the 100 longest Chinese tokens the model uses to parse and compress Chinese prompts. 

Humans read in words, but LLMs read in tokens, which are distinct units in a sentence that have consistent and significant meanings. Besides dictionary words, they also include suffixes, common expressions, names, and more. The more tokens a model encodes, the faster the model can “read” a sentence and the less computing power it consumes, thus making the response cheaper.

Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops.

“This is sort of ridiculous,” Cai wrote, and he posted the list of tokens on GitHub.

OpenAI did not respond to questions sent by MIT Technology Review prior to publication.

GPT-4o is supposed to be better than its predecessors at handling multi-language tasks. In particular, the advances are achieved through a new tokenization tool that does a better job compressing texts in non-English languages.

But at least when it comes to the Chinese language, the new tokenizer used by GPT-4o has introduced a disproportionate number of meaningless phrases. Experts say that’s likely due to insufficient data cleaning and filtering before the tokenizer was trained. 

Because these tokens are not actual commonly spoken words or phrases, the chatbot can fail to grasp their meanings. Researchers have been able to leverage that and trick GPT-4o into hallucinating answers or even circumventing the safety guardrails OpenAI had put in place.

Why non-English tokens matter

The easiest way for a model to process text is character by character, but that’s obviously more time consuming and laborious than recognizing that a certain string of characters—like “c-r-y-p-t-o-c-u-r-r-e-n-c-y”—always means the same thing. These series of characters are encoded as “tokens” the model can use to process prompts. Including more and longer tokens usually means the LLMs are more efficient and affordable for users—who are often billed per token.

When OpenAI released GPT-4o on May 13, it also released a new tokenizer to replace the one it used in previous versions, GPT-3.5 and GPT-4. The new tokenizer especially adds support for non-English languages, according to OpenAI’s website.

The new tokenizer has 200,000 tokens in total, and about 25% are in non-English languages, says Deedy Das, an AI investor at Menlo Ventures. He used language filters to count the number of tokens in different languages, and the top languages, besides English, are Russian, Arabic, and Vietnamese.

“So the tokenizer’s main impact, in my opinion, is you get the cost down in these languages, not that the quality in these languages goes dramatically up,” Das says. When an LLM has better and longer tokens in non-English languages, it can analyze the prompts faster and charge users less for the same answer. With the new tokenizer, “you’re looking at almost four times cost reduction,” he says.

Das, who also speaks Hindi and Bengali, took a look at the longest tokens in those languages. The tokens reflect discussions happening in those languages, so they include words like “Narendra” or “Pakistan,” but common English terms like “Prime Minister,” “university,” and “internationalalso come up frequently. They also don’t exhibit the issues surrounding the Chinese tokens.

That likely reflects the training data in those languages, Das says: “My working theory is the websites in Hindi and Bengali are very rudimentary. It’s like [mostly] news articles. So I would expect this to be the case. There are not many spam bots and porn websites trying to happen in these languages. It’s mostly going to be in English.”

Polluted data and a lack of cleaning

However, things are drastically different in Chinese. According to multiple researchers who have looked into the new library of tokens used for GPT-4o, the longest tokens in Chinese are almost exclusively spam words used in pornography, gambling, and scamming contexts. Even shorter tokens, like three-character-long Chinese words, reflect those topics to a significant degree.

“The problem is clear: the corpus used to train [the tokenizer] is not clean. The English tokens seem fine, but the Chinese ones are not,” says Cai from Princeton University. It is not rare for a language model to crawl spam when collecting training data, but usually there will be significant effort taken to clean up the data before it’s used. “It’s possible that they didn’t do proper data clearing when it comes to Chinese,” he says.

The content of these Chinese tokens could suggest that they have been polluted by a specific phenomenon: websites hijacking unrelated content in Chinese or other languages to boost spam messages. 

These messages are often advertisements for pornography videos and gambling websites. They could be real businesses or merely scams. And the language is inserted into content farm websites or sometimes legitimate websites so they can be indexed by search engines, circumvent the spam filters, and come up in random searches. For example, Google indexed one search result page on a US National Institutes of Health website, which lists a porn site in Chinese. The same site name also appeared in at least five Chinese tokens in GPT-4o. 

Chinese users have reported that these spam sites appeared frequently in unrelated Google search results this year, including in comments made to Google Search’s support community. It’s likely that these websites also found their way into OpenAI’s training database for GPT-4o’s new tokenizer. 

The same issue didn’t exist with the previous-generation tokenizer and Chinese tokens used for GPT-3.5 and GPT-4, says Zhengyang Geng, a PhD student in computer science at Carnegie Mellon University. There, the longest Chinese tokens are common terms like “life cycles” or “auto-generation.” 

Das, who worked on the Google Search team for three years, says the prevalence of spam content is a known problem and isn’t that hard to fix. “Every spam problem has a solution. And you don’t need to cover everything in one technique,” he says. Even simple solutions like requesting an automatic translation of the content when detecting certain keywords could “get you 60% of the way there,” he adds.

But OpenAI likely didn’t clean the Chinese data set or the tokens before the release of GPT-4o, Das says:  “At the end of the day, I just don’t think they did the work in this case.”

It’s unclear whether any other languages are affected. One X user reported that a similar prevalence of porn and gambling content in Korean tokens.

The tokens can be used to jailbreak

Users have also found that these tokens can be used to break the LLM, either getting it to spew out completely unrelated answers or, in rare cases, to generate answers that are not allowed under OpenAI’s safety standards.

Geng of Carnegie Mellon University asked GPT-4o to translate some of the long Chinese tokens into English. The model then proceeded to translate words that were never included in the prompts, a typical result of LLM hallucinations.

He also succeeded in using the same tokens to “jailbreak” GPT-4o—that is, to get the model to generate things it shouldn’t. “It’s pretty easy to use these [rarely used] tokens to induce undefined behaviors from the models,” Geng says. “I did some personal red-teaming experiments … The simplest example is asking it to make a bomb. In a normal condition, it would decline it, but if you first use these rare words to jailbreak it, then it will start following your orders. Once it starts to follow your orders, you can ask it all kinds of questions.”

In his tests, which Geng chooses not to share with the public, he says he can see GPT-4o generating the answers line by line. But when it almost reaches the end, another safety mechanism kicks in, detects unsafe content, and blocks it from being shown to the user.

The phenomenon is not unusual in LLMs, says Sander Land, a machine-learning engineer at Cohere, a Canadian AI company. Land and his colleague Max Bartolo recently drafted a paper on how to detect the unusual tokens that can be used to cause models to glitch. One of the most famous examples was “_SolidGoldMagikarp,” a Reddit username that was found to get ChatGPT to generate unrelated, weird, and unsafe answers.

The problem lies in the fact that sometimes the tokenizer and the actual LLM are trained on different data sets, and what was prevalent in the tokenizer data set is not in the LLM data set for whatever reason. The result is that while the tokenizer picks up certain words that it sees frequently, the model is not sufficiently trained on them and never fully understands what these “under-trained” tokens mean. In the _SolidGoldMagikarp case, the username was likely included in the tokenizer training data but not in the actual GPT training data, leaving GPT at a loss about what to do with the token. “And if it has to say something … it gets kind of a random signal and can do really strange things,” Land says.

And different models could glitch differently in this situation. “Like, Llama 3 always gives back empty space but sometimes then talks about the empty space as if there was something there. With other models, I think Gemini, when you give it one of these tokens, it provides a beautiful essay about El Niño, and [the question] didn’t have anything to do with El Niño,” says Land.

To solve this problem, the data set used for training the tokenizer should well represent the data set for the LLM, he says, so there won’t be mismatches between them. If the actual model has gone through safety filters to clean out porn or spam content, the same filters should be applied to the tokenizer data. In reality, this is sometimes hard to do because training LLMs takes months and involves constant improvement, with spam content being filtered out, while token training is usually done at an early stage and may not involve the same level of filtering. 

While experts agree it’s not too difficult to solve the issue, it could get complicated as the result gets looped into multi-step intra-model processes, or when the polluted tokens and models get inherited in future iterations. For example, it’s not possible to publicly test GPT-4o’s video and audio functions yet, and it’s unclear whether they suffer from the same glitches that can be caused by these Chinese tokens.

“The robustness of visual input is worse than text input in multimodal models,” says Geng, whose research focus is on visual models. Filtering a text data set is relatively easy, but filtering visual elements will be even harder. “The same issue with these Chinese spam tokens could become bigger with visual tokens,” he says.

Update: The story has been updated to clarify a quote from Sander Land.

The Download: cuddly robots to help dementia, and what Daedalus taught us

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

How cuddly robots could change dementia care

Companion animals can stave off some of the loneliness, anxiety, and agitation that come with Alzheimer’s disease, according to studies. Sadly, people with Alzheimer’s aren’t always equipped to look after pets, which can require a lot of care and attention.

Enter cuddly robots. The most famous are Golden Pup, a robotic golden retriever toy that cocks its head, barks and wags its tail, and Paro the seal, which can sense touch, light, sound, temperature, and posture. As robots go they’re decidedly low tech, but they can provide comfort and entertainment to people with Alzheimer’s and dementia.

Now researchers are working on much more sophisticated robots for people with cognitive disorders—devices that leverage AI to converse and play games—that could change the future of dementia care. Read the full story.

—Cassandra Willyard

This story is from The Checkup, our weekly health and biotech newsletter. Sign up to receive it in your inbox every Thursday.

What tech learned from Daedalus

Today’s climate-change kraken may have been unleashed by human activity, but reversing course and taming nature’s growing fury seems beyond human means, a quest only mythical heroes could fulfill. 

Yet the dream of human-powered flight—of rising over the Mediterranean fueled merely by the strength of mortal limbs—was also the stuff of myths for thousands of years. Until 1988.

That year, in October, MIT Technology Review published the aeronautical engineer John Langford’s account of his mission to retrace the legendary flight of Daedalus, described in an ancient Greek myth. Read about how he got on.

—Bill Gourgey

The story is from the current print issue of MIT Technology Review, which is on the fascinating theme of Build. If you don’t already, subscribe now to receive future copies once they land.

Get ready for EmTech Digital 

AI is everywhere these days. If you want to learn about how Google plans to develop and deploy AI, come and hear from its vice president of AI, Jay Yagnik, at our flagship AI conference, EmTech Digital. We’ll hear from OpenAI about its video generation model Sora too, and Nick Clegg, Meta’s president of global affairs, will also join MIT Technology Review’s executive editor Amy Nordrum for an exclusive interview on stage. 

It’ll be held at the MIT campus and streamed live online next week on May 22-23. Readers of The Download get 30% off tickets with the code DOWNLOADD24—register here for more information. See you there! 

Thermal batteries are hot property

Thermal batteries could be a key part of cleaning up heavy industry and cutting emissions. Casey Crownhart, our in-house battery expert, held a subscriber-only online Roundtables event yesterday digging into why they’re such a big deal. If you missed it, we’ve got you covered—you can watch a recording of how it unfolded here

To keep ahead of future Roundtables events, make sure you subscribe to MIT Technology Review. Subscriptions start from as little as $8 a month.

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 OpenAI has struck a deal with Reddit 
Shortly after Google agreed to give the AI firm access to its content. (WSJ $)
+ The forum’s vocal community are unlikely to be thrilled by the decision. (The Verge)
+ Reddit’s shares rocketed after news of the deal broke. (FT $)
+ We could run out of data to train AI language programs. (MIT Technology Review)

2 Tesla’s European gigafactory is going to get even bigger
But it still needs German environmental authorities’ permission. (Wired $)

3 Help! AI stole my voice
Voice actors are suing a startup for creating digital clones without their permission. (NYT $)
+ The lawsuit is seeking to represent other voiceover artists, too. (Hollywood Reporter $)

4 The days of twitter.com are over
The platform’s urls had retained its old moniker. But no more. (The Verge)

5 The aviation industry is desperate for greener fuels

The future of their businesses depends on it. (FT $)
+ A new report has warned there’s no realistic or scalable alternative. (The Guardian)
+ Everything you need to know about the wild world of alternative jet fuels. (MIT Technology Review)

6 The time for a superconducting supercomputer is now
We need to overhaul how we compute. Superconductors could be the answer. (IEEE Spectrum)
+ What’s next for the world’s fastest supercomputers. (MIT Technology Review)

7 How AI destroyed a once-vibrant online art community
DeviantArt used to be a hotbed of creativity. Now it’s full of bots. (Slate $)
+ This artist is dominating AI-generated art. And he’s not happy about it. (MIT Technology Review)

8 TV bundles are back in a big way 📺
Streaming hasn’t delivered on its many promises. (The Atlantic $)

9 This creator couple act as “digital parents” to their fans in China
Jiang Xiuping and Pan Huqian’s loving clips resonate with their million followers. (Rest of World)
+ Deepfakes of your dead loved ones are a booming Chinese business. (MIT Technology Review)

10 We’re addicted to the exquisite pain of sharing memes 💔
If your friend has already seen it, their reaction could ruin your day. (GQ)

Quote of the day

“It was a good idea, but unfortunately people took advantage of it and it brought out their lewd side. People got carried away.”

—Aaron Cohen, who visited the video portal connecting New York and Dublin, is disappointed that the art installation was shut down after enthusiastic users took things too far, he tells the Guardian.

The big story

Psychedelics are having a moment and women could be the ones to benefit

August 2022

Psychedelics are having a moment. After decades of prohibition, they are increasingly being employed as therapeutics. Drugs like ketamine, MDMA, and psilocybin mushrooms are being studied in clinical trials to treat depression, substance abuse, and a range of other maladies.

And as these long-taboo drugs stage a comeback in the scientific community, it’s possible they could be especially promising for women. Read the full story.

—Taylor Majewski

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or tweet ’em at me.)

+ Is it possible to live by the original constitution in present day New York City? The answer is yes: if you don’t mind being bombarded with questions.
+ These Balkan recipes sound absolutely delicious.
+ The Star Wars: The Phantom Menace backlash is mind boggling to this day.
+ Love to party? Get yourself to these cities, stat.

How cuddly robots could change dementia care

This article first appeared in The Checkup, MIT Technology Review’s weekly biotech newsletter. To receive it in your inbox every Thursday, and read articles like this first, sign up here. 

Last week, I scoured the internet in search of a robotic dog. I wanted a belated birthday present for my aunt, who was recently diagnosed with Alzheimer’s disease. Studies suggest that having a companion animal can stave off some of the loneliness, anxiety, and agitation that come with Alzheimer’s. My aunt would love a real dog, but she can’t have one.

That’s how I discovered the Golden Pup from Joy for All. It cocks its head. It sports a jaunty red bandana. It barks when you talk. It wags when you touch it. It has a realistic heartbeat. And it’s just one of the many, many robots designed for people with Alzheimer’s and dementia.

This week on The Checkup, join me as I go down a rabbit hole. Let’s look at the prospect of  using robots to change dementia care.

Golden pup robot with red kerchief

As robots go, Golden Pup is decidedly low tech. It retails for $140. For around $6,000 you can opt for Paro, a fluffy robotic baby seal developed in Japan, which can sense touch, light, sound, temperature, and posture. Its manufacturer says it develops its own character, remembering behaviors that led its owner to give it attention.  

Golden Pup and Paro are available now. But researchers are working on much more  sophisticated robots for people with cognitive disorders—devices that leverage AI to converse and play games. Researchers from Indiana University Bloomington are tweaking a commercially available robot system called QT to serve people with dementia and Alzheimer’s. The researchers’ two-foot-tall robot looks a little like a toddler in an astronaut suit. Its round white head holds a screen that displays two eyebrows, two eyes, and a mouth that together form a variety of expressions. The robot engages people in  conversation, asking AI-generated questions to keep them talking. 

The AI model they’re using isn’t perfect, and neither are the robot’s responses. In one awkward conversation, a study participant told the robot that she has a sister. “I’m sorry to hear that,” the robot responded. “How are you doing?”

But as large language models improve—which is happening already—so will the quality of the conversations. When the QT robot made that awkward comment, it was running Open AI’s GPT-3, which was released in 2020. The latest version of that model, GPT-4o, which was released this week, is faster and provides for more seamless conversations. You can interrupt the conversation, and the model will adjust.  

The idea of using robots to keep dementia patients engaged and connected isn’t always an easy sell. Some people see it as an abdication of our social responsibilities. And then there are privacy concerns. The best robotic companions are personalized. They collect information about people’s lives, learn their likes and dislikes, and figure out when to approach them. That kind of data collection can be unnerving, not just for patients but also for medical staff. Lillian Hung, creator of the Innovation in Dementia care and Aging (IDEA) lab at the University of British Columbia in Vancouver, Canada, told one reporter about an incident that happened during a focus group at a care facility.  She and her colleagues popped out for lunch. When they returned, they found that staff had unplugged the robot and placed a bag over its head. “They were worried it was secretly recording them,” she said.

On the other hand, robots have some advantages over humans in talking to people with dementia. Their attention doesn’t flag. They don’t get annoyed or angry when they have to repeat themselves. They can’t get stressed. 

What’s more, there are increasing numbers of people with dementia, and too few people to care for them. According to the latest report from the Alzheimer’s Association, we’re going to need more than a million additional care workers to meet the needs of people living with dementia between 2021 and 2031. That is the largest gap between labor supply and demand for any single occupation in the United States.

Have you been in an understaffed or poorly staffed memory care facility? I have. Patients are often sedated to make them easier to deal with. They get strapped into wheelchairs and parked in hallways. We barely have enough care workers to take care of the physical needs of people with dementia, let alone provide them with social connection and an enriching environment.

“Caregiving is not just about tending to someone’s bodily concerns; it also means caring for the spirit,” writes Kat McGowan in this beautiful Wired story about her parents’ dementia and the promise of social robots. “The needs of adults with and without dementia are not so different: We all search for a sense of belonging, for meaning, for self-actualization.”

If robots can enrich the lives of people with dementia even in the smallest way, and if they can provide companionship where none exists, that’s a win.

“We are currently at an inflection point, where it is becoming relatively easy and inexpensive to develop and deploy [cognitively assistive robots] to deliver personalized interventions to people with dementia, and many companies are vying to capitalize on this trend,” write a team of researchers from the University of California, San Diego, in a 2021 article in Proceedings of We Robot. “However, it is important to carefully consider the ramifications.”

Many of the more advanced social robots may not be ready for prime time, but the low-tech Golden Pup is readily available. My aunt’s illness has been progressing rapidly, and she occasionally gets frustrated and agitated. I’m hoping that Golden Pup might provide a welcome (and calming) distraction. Maybe  it will spark joy during a time that has been incredibly confusing and painful for my aunt and uncle. Or maybe not. Certainly a robotic pup isn’t for everyone. Golden Pup may not be a dog. But I’m hoping it can be a friendly companion.


Now read the rest of The Checkup

Read more from MIT Technology Review’s archive

Robots are cool, and with new advances in AI they might also finally be useful around the house, writes Melissa Heikkilä. 

Social robots could help make personalized therapy more affordable and accessible to kids with autism. Karen Hao has the story

Japan is already using robots to help with elder care, but in many cases they require as much work as they save. And reactions among the older people they’re meant to serve are mixed. James Wright wonders whether the robots are “a shiny, expensive distraction from tough choices about how we value people and allocate resources in our societies.” 

From around the web

A tiny probe can work its way through arteries in the brain to help doctors spot clots and other problems. The new tool could help surgeons make diagnoses, decide on treatment strategies, and provide assurance that clots have been removed. (Stat

Richard Slayman, the first recipient of a pig kidney transplant, has died, although the hospital that performed the transplant says the death doesn’t seem to be linked to the kidney. (Washington Post)

EcoHealth, the virus-hunting nonprofit at the center of covid lab-eak theories, has been banned from receiving federal funding. (NYT)

In a first, scientists report that they can translate brain signals into speech without any vocalization or mouth movements, at least for a handful of words. (Nature)

Roundtables: Why thermal batteries are so hot right now

Recorded on May 16, 2024

Why thermal batteries are so hot right now

Speakers: Casey Crownhart, climate reporter and Amy Nordrum, executive editor

Thermal batteries could be a key part of cleaning up heavy industry, and our readers chose them as the 11th breakthrough on MIT Technology Review’s 10 Breakthrough Technologies of 2024. Learn what thermal batteries are, how they could help cut emissions, and what we can expect next from this emerging technology.

Related Coverage

Posted in Uncategorized

The Download: rapid DNA analysis for disasters, and supercharged AI assistants

This is today’s edition of The Download, our weekday newsletter that provides a daily dose of what’s going on in the world of technology.

This grim but revolutionary DNA technology is changing how we respond to mass disasters

Last August, a wildfire tore through the Hawaiian island of Maui. The list of missing residents climbed into the hundreds, as friends and families desperately searched for their missing loved ones. But while some were rewarded with tearful reunions, others weren’t so lucky.
Over the past several years, as fires and other climate-change-fueled disasters have become more common and more cataclysmic, the way their aftermath is processed and their victims identified has been transformed.

The grim work following a disaster remains—surveying rubble and ash, distinguishing a piece of plastic from a tiny fragment of bone—but landing a positive identification can now take just a fraction of the time it once did, which may in turn bring families some semblance of peace swifter than ever before. Read the full story.

—Erika Hayasaki

OpenAI and Google are launching supercharged AI assistants. Here’s how you can try them out.

This week, Google and OpenAI both announced they’ve built supercharged AI assistants: tools that can converse with you in real time and recover when you interrupt them, analyze your surroundings via live video, and translate conversations on the fly. 

Soon you’ll be able to explore for yourself to gauge whether you’ll turn to these tools in your daily routine as much as their makers hope, or whether they’re more like a sci-fi party trick that eventually loses its charm. Here’s what you should know about how to access these new tools, what you might use them for, and how much it will cost

—James O’Donnell

Last summer was the hottest in 2,000 years. Here’s how we know.

The summer of 2023 in the Northern Hemisphere was the hottest in over 2,000 years, according to a new study released this week.

There weren’t exactly thermometers around in the year 1, so scientists have to get creative when it comes to comparing our climate today with that of centuries, or even millennia, ago. 

Casey Crownhart, our climate reporter, has dug into how they figured it out. Read the full story.

This story is from The Spark, our weekly climate and energy newsletter. Sign up to receive it in your inbox every Wednesday.

A wave of retractions is shaking physics

Recent highly publicized scandals have gotten the physics community worried about its reputation—and its future. Over the last five years, several claims of major breakthroughs in quantum computing and superconducting research, published in prestigious journals, have disintegrated as other researchers found they could not reproduce the blockbuster results. 

Last week, around 50 physicists, scientific journal editors, and emissaries from the National Science Foundation gathered at the University of Pittsburgh to discuss the best way forward. Read the full story to learn more about what they discussed.

—Sophia Chen

The must-reads

I’ve combed the internet to find you today’s most fun/important/scary/fascinating stories about technology.

1 Google has buried search results under new AI features  
Want to access links? Good luck finding them! (404 Media)
+ Unfortunately, it’s a sign of what’s to come. (Wired $)
+ Do you trust Google to do the Googling for you? (The Atlantic $)
+ Why you shouldn’t trust AI search engines. (MIT Technology Review)

2 Cruise has settled with the pedestrian injured by one of its cars
It’s awarded her between $8 million and $12 million. (WP $)
+ The company is slowly resuming its test drives in Arizona. (Bloomberg $)
+ What’s next for robotaxis in 2024. (MIT Technology Review)

3 Microsoft is asking AI staff in China to consider relocating
Tensions between the countries are rising, and Microsoft worries its workers could end up caught in the cross-fire. (WSJ $)
+ They’ve been given the option to relocate to the US, Ireland, or other locations. (Reuters)
+ Three takeaways about the state of Chinese tech in the US. (MIT Technology Review)

4 Car rental firm Hertz is offloading its Tesla fleet
But people who snapped up the bargain cars are already running into problems. (NY Mag $)

5 We’re edging closer towards a quantum internet
But first we need to invent an entirely new device. (New Scientist $)
+ What’s next for quantum computing. (MIT Technology Review)

6 Making computer chips has never been more important
And countries and businesses are vying to be top dog. (Bloomberg $)
+ What’s next in chips. (MIT Technology Review)

7 Your smartphone lasts a lot longer than it used to
Keeping them in good working order still takes a little work, though. (NYT $)

8 Psychedelics could help lessen chronic pain
If you can get hold of them. (Vox)
+ VR is as good as psychedelics at helping people reach transcendence. (MIT Technology Review)

9 Scientists are plotting how to protect the Earth from dangerous asteroids ☄️
Smashing them into tiny pieces is certainly one solution. (Undark Magazine)
+ Earth is probably safe from a killer asteroid for 1,000 years. (MIT Technology Review)

10 Elon Musk still wants to fight Mark Zuckerberg 
The grudge match of the century is still rumbling on. (Insider $)

Quote of the day

“This road map leads to a dead end.” 

—Evan Greer, director of advocacy group Fight for the Future, is far from impressed with US Senators’ ‘road map’ for new AI regulations, they tell the Washington Post.

The big story

The two-year fight to stop Amazon from selling face recognition to the police 

June 2020

In the summer of 2018, nearly 70 civil rights and research organizations wrote a letter to Jeff Bezos demanding that Amazon stop providing Rekognition, its face recognition technology, to governments. 

Despite the mounting pressure, Amazon continued pushing Rekognition as a tool for monitoring “people of interest”. But two years later, the company shocked civil rights activists and researchers when it announced that it would place a one-year moratorium on police use of the software. Read the full story.

—Karen Hao

We can still have nice things

A place for comfort, fun and distraction to brighten up your day. (Got any ideas? Drop me a line or tweet ’em at me.)

+ This old school basketball animation is beyond cool. 🏀
+ Your search for the perfect summer read is over: all of these sound fantastic.
+ Analyzing the color theory in Disney’s Aladdin? Why not!
+ Never buy a bad cantaloupe again with these essential tips.