
There is a vast hidden workforce behind AI
WHEN DEEPSEEK, a hotshot Chinese firm, released its cheap large language model late last year it overturned long-standing assumptions about what it will take to build the next generation of artificial intelligence (AI). This will matter to whoever comes out on top in the epic global battle for AI supremacy. Developers are now reconsidering how much hardware, energy and data are needed. Yet another, less discussed, input in machine intelligence is in flux too: the workforce.
To the layman, AI is all robots, machines and models. It is a technology that kills jobs. In fact, there are millions of workers involved in producing AI models. Much of their work has involved tasks like tagging objects in images of roads in order to train self-driving cars and labelling words in the audio recordings used to train speech-recognition systems. Technically, annotators give data the contextual information computers need to work out the statistical associations between components of a dataset and their meaning to human beings. In fact, anyone who has completed a CAPTCHA test, selecting photos containing zebra crossings, may have inadvertently helped train an AI.
This is the 'unsexy" part of the industry, as Alex Wang, the boss of Scale AI, a data firm, puts it. Although Scale AI says most of its contributor work happens in America and Europe, across the industry much of the labour is outsourced to poor parts of the world, where lots of educated people are looking for work. The Chinese government has teamed up with tech companies, such as Alibaba and JD.com, to bring annotation jobs to far-flung parts of the country. In India the IT industry body, Nasscom, reckons annotation revenues could reach $7bn a year and employ 1m people there by 2030. That is significant, since India's entire IT industry is worth $254bn a year (including hardware) and employs 5.5m people.
Annotators have long been compared to parents, teaching models and helping them make sense of the world. But the latest models don't need their guidance in the same way. As the technology grows up, are its teachers becoming redundant?
Data annotation is not new. Fei Fei Li, an American computer scientist known as 'the godmother of AI", is credited with firing the industry's starting gun in the mid-2000s when she created ImageNet, the largest image dataset at the time. Ms Li realised that if she paid college students to categorise the images, which was then how most researchers did things, the task would take 90 years. Instead, she hired workers around the world using Mechanical Turk, an online gig-work platform run by Amazon. She got some 3.2m images organised into a dataset in two and a half years. Soon other AI labs were outsourcing annotation work this way, too.
Over time developers got fed up with the low-quality annotation done by untrained workers on gig-work sites. AI-data firms, such as Sama and iMerit, emerged. They hired workers across the poor world. Informal annotation work continued but specialist platforms emerged for AI work, like those run by Scale AI, which tests and trains workers. The World Bank reckons that between 4.4% and 12.4% of the global workforce is involved in gig work, including annotation for AI. Krystal Kauffman, a Michigan resident who has been doing data work online for a decade, reckons that tech companies have an interest in keeping this workforce hidden. 'They are selling magic—this idea that all these things happen by themselves," Ms Kauffman, says. 'Without the magic part of it, AI is just another product."
A debate in the industry has been about the treatment of the workers behind AI. Firms are reluctant to share information on wages. But American annotators generally consider $10-20 per hour to be decent pay on online platforms. Those in poor countries often get $4-8 per hour. Many must use monitoring tools that track their computer activity and are penalised for being slow. Scale AI has been hit with several lawsuits over its employment practices. The firm denies wrongdoing and says: 'We plan to defend ourselves vigorously."
The bigger issue, though, is that basic annotation work is drying up. In part, this was inevitable. If AI was once a toddler who needed a parent to point things out and to help it make sense of the world around it, the technology has grown into an adolescent who needs occasional specialist guidance and advice. AI labs increasingly use pre-labelled data from other AI labs, which use algorithms to apply labels to datasets.
Take the example of self-driving tractors developed by Blue River Technology, a subsidiary of John Deere, an agricultural-equipment giant. Three years ago the group's engineers in America would upload pictures of farmland into the cloud and provide iMerit staff in Hubli, India, with careful instructions on what to label: tractors, buildings, irrigation equipment. Now the developers use pre-labelled data. They still need iMerit staff to check that labelling and to deal with 'edge cases", for example where a dust cloud obscures part of the landscape or a tree throws shade over crops, confusing the model. A process that took months now takes weeks.
From baby steps
The most recent wave of AI models has changed data work more dramatically. Since 2022, when OpenAI first let the public play with its ChatGPT chatbot, there has been a rush of interest in large language models. Data from Pitchbook, a research firm, suggest that global venture-capital funding for AI startups jumped by more than 50% in 2024 to $131.5bn, even as funding for other startups fell. Much of it is going into newer techniques for developing AI, which do not need data annotated in the same way. Iva Gumnishka at Humans in the Loop, a social enterprise, says firms doing low-skilled annotation for older computer-vision and natural-language-processing clients are being 'left behind".
There is still demand for annotators, but their work has changed. As businesses start to deploy AI, they are building smaller specialised models and looking for highly educated annotators to help. It has become fairly common for adverts for annotation jobs to require a PhD or skills in coding and science. Now that researchers are trying to make AI more multilingual, demand for annotators who speak languages other than English is growing, too. Sushovan Das, a dentist working on medical-AI projects at iMerit, reckons that annotation work will never disappear. 'This world is constantly evolving," he says. 'So the AI needs to be improved time and again."
New roles for humans in training AI are emerging. Epoch AI, a research firm, reckons the stock of high-quality text available for training may be exhausted by 2026. Some AI labs are hiring people to write chunks of text and lines of code that models can be trained on. Others are buying synthetic data, created using computer algorithms, and hiring humans to verify it. 'Synthetic data still needs to be good data," says Wendy Gonzalez, the boss of Sama, which has operations east Africa.
The other role for workers is in evaluating the output from models and helping to hammer it into shape. That is what got ChatGPT to perform better than previous chatbots. Xiaote Zhu at Scale AI provides an example of the sort of open-ended tasks being done on the firm's Outlier platform, which was launched in 2023 to facilitate the training of AI by experts. Workers are presented with two responses from a chatbot recommending an itinerary for a holiday to the Maldives. They need to select which response they prefer, rate it, explain why the answer is good or bad and then rewrite the response to improve it.
Ms Zhu's example is a fairly anodyne one. Yet human feedback is also crucial to making sure AI is safe and ethical. In a document that was published after the launch of ChatGPT in 2022, OpenAI said it had hired experts to 'qualitatively probe, adversarially test and generally provide feedback" on its models. At the end of that process the model refused to respond to certain prompts, such as requests to write social-media content aimed at persuading people to join al-Qaeda, a terrorist group.
Flying the nest
If AI developers had their way they would not need this sort of human input at all. Studies suggest that as much as 80% of the time that goes into the development of AI is spent on data work. Naveen Rao at Databricks, an AI firm, says he would like models to teach themselves, just as he would like his own children to do. 'I want to build self-efficacious humans," he says. 'I want them to have their own curiosity and figure out how to solve problems. I don't want to spoon-feed them every step of the way."
There is a lot of excitement about unsupervised learning, which involves feeding models unlabelled data, and reinforcement learning, which uses trial and error to improve decision-making. AI firms, including Google DeepMind, have trained machines to win at games like Go and chess by playing millions of contests against themselves and tracking which strategies work, without any human input at all. But that self-taught approach doesn't work outside the realms of maths and science, at least for the moment.
Tech nerds everywhere have been blown away by how cheap and efficient DeepSeek's model is. But they are less impressed by DeepSeek's attempt to train AI using feedback generated by computers rather than humans. The model struggled to answer open-ended questions, producing gobbledygook in a mixture of languages. 'The difference is that with Go and chess the desired outcome is crystal clear: win the game," says Phelim Bradley, co-founder of Prolific, another AI-data firm. 'Large language models are more complex and far-reaching, so humans are going to remain in the loop for a long time."
Mr Bradley, like many techies, reckons that more people will need to get involved in training AI, not fewer. Diversity in the workforce matters. When ChatGPT was released a few years ago, people noticed that it overused the word 'delve". The word became seen as 'AI-ese", a telltale sign that the text was written by a bot. In fact, annotators in Africa had been hired to train the model and the word 'delve" is more commonly used in African English than it is in American or British English. In the same way as workers' skills and knowledge are transferred to models, their vocabulary is, too. As it turns out, it takes more than just a village to raise a child.
Clarification: This article has been amended to reflect Scale AI's claim that most of its labour is based in America and Europe.

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


The Hindu
17 minutes ago
- The Hindu
Apple faces intense WWDC pressure; Elon Musk deletes post about Trump; Chinese hackers aim for U.S. smartphones
Apple faces intense WWDC pressure Apple is set to host its annual Worldwide Developers Conference (WWDC) tonight at Silicon Valley. While the event is largely aimed at developers, the iPhone-maker generally puts out a slew of announcements and launch updates about its operating systems, device upgrades, AI enhancements, and more. However, this year will see Apple presenting its annual event under significantly more pressure, as it faces customer anger and sharp questions over the delayed deployment of all the Apple Intelligence features it promised. The elephant in the room will be Apple's inability to release the AI-enhanced Siri assistant as previously promised, and the disappointment of Apple iPhone 16 buyers who purchased the phone in order to access this. Analysts have predicted this year's WWDC could see Apple attempting to rebuild its damaged credibility or possibly focusing on more standard updates regarding its OS. Others are expecting possible announcements about Apple integrating Google's Gemini, Perplexity, or chatbots from other companies. Elon Musk deletes post about Trump Elon Musk has deleted a post on his platform X (formerly Twitter) where he had accused U.S. President Donald Trump of being named in the Epstein Files. This referred to U.S. government files related to allies of Jeffrey Epstein, who died by suicide in 2019 while facing sex trafficking charges. Musk had posted the claim on X during a highly-public feud with Trump over the latter's controversial spending bill. The post garnered millions of views and ignited a range of reactions across the political spectrum, as Musk claimed that Trump's alleged inclusion in the list was the reason it had not been made public. However, Musk deleted the posts over the weekend. He and Trump continued to criticise each other from their respective social media platforms; Trump is an avid user and endorser of the right-wing Truth Social. The digital fight between the two billionaires also served to bring curious and amused viewers to X, with views and activity levels reportedly spiking as high-profile insults were hurled. Chinese hackers aim for U.S. smartphones While cybersecurity experts warn about digital attacks against the topmost layers of the U.S. government and military, one major vulnerability that is often left out of discussions is a user's everyday smartphone. In order to gain access to a victim's calls and communications, hackers no longer need to dupe users into clicking a link. Instead, Chinese hackers are reportedly targeting the smartphones of high-profile U.S. users in order to carry out espionage operations, according to multiple cybersecurity experts. Chinese hackers have also tried to compromise phones used by U.S. President Donald Trump and JD Vance during their 2024 campaign. Just months ago, U.S. authorities warned that a large-scale Chinese hacking campaign was trying to gain access to the texts/phone conversations of an unknown number of U.S.-based users.


Economic Times
25 minutes ago
- Economic Times
China's defence stocks jump up to 10% led by AVIC Shenyang as Pakistan announces arms purchase
Chinese defence stocks rallied after Pakistan announced plans to buy 40 J-35 stealth fighter jets, marking China's first export of the fifth-generation aircraft. AVIC Shenyang surged 10%, with broader sector gains following the major arms deal. Tired of too many ads? Remove Ads Tired of too many ads? Remove Ads Shares of Chinese defense companies rallied on Monday after Pakistan said that it intends to buy one of the Asian power's most-advanced fighter jets as a part of major arms purchase. Among the top gainers was AVIC Shenyang Aircraft Company, whose stock surged over 10% following Pakistan's announcement that it plans to buy J-35 stealth fighter jet from the Chinese as one of the country's most-advanced fighter jets, the deal is seen as a part of major arms purchase by India's western Friday, the government of Pakistan in a social media post said that it would acquire 40 J-35 fifth-generation fighter jets, KJ-500 airborne early warning and control aircraft, as well as HQ-19 ballistic missile defense of AVIC Shenyang Aircraft have remained unbeaten for the past three trading sessions, Today's rally has extended stocks gains to 13% in the past five sessions. While it is a month since the ceasefire between India and Pakistan was implemented, AVIC Shenyang Aircraft shares have jumped 18% in this armed escalation broke out between India and Pakistan between May 7 and May 10 as India hit several targets in Pak Occupied Kashmir (PoK) to avenge the killings in Pahalgam where 26 innocent tourists were Shenyang Aircraft has been in the thick of action owing to its J-35 stealth fighter jetMeanwhile, other Chinese defence stocks also saw a positive rub-off impact as Aerospace Nanhu Electronic Information Technology Co., whose shares jumped as much as 15% on the intraday basis. This counter has rallied 10% in the past five trading CH UAV Co shares were up by over 1% while Inner Mongolia First Machinery Grp Co was 4% J-35 sale to Pakistan would mark China's first export of the fifth-generation jet, which has advanced stealth capabilities for penetrating the airspace of an adversary, a Bloomberg report said. The fighter was developed by Shenyang Aircraft Corporation and publicly unveiled at the 2024 Zhuhai Airshow, it KJ-500 aircraft would improve Pakistan's radar coverage and its smaller size allows for nimbler use in regional clashes. The HQ-19 surface-to-air missile systems would enhance the country's ability to intercept ballistic deal comes amid persisting tensions between Pakistan and India. The nuclear-armed neighbors clashed several weeks ago, with both sides trading air, drone and missile strikes, as well as artillery and small arms fire along their shared border in early May.


Mint
25 minutes ago
- Mint
Rednote joins wave of Chinese firms releasing open-source AI models
BEIJING, June 9 (Reuters) - China's Rednote, one of the country's most popular social media platforms, has released an open-source large language model, joining a wave of Chinese tech firms making their artificial intelligence models freely available. The approach contrasts with many U.S. tech giants like OpenAI and Google, which have kept their most advanced models proprietary, though some American firms including Meta have also released open-source models. You may be interested in Open sourcing allows Chinese companies to demonstrate their technological capabilities, build developer communities and spread influence globally at a time when the U.S. has sought to stymie China's tech progress with export restrictions on advanced semiconductors. Rednote's model, called is available for download on developer platform Hugging Face. A company technical paper describing it was uploaded on Friday. In coding tasks, the model performs comparably to Alibaba's Qwen 2.5 series, though it trails more advanced models such as DeepSeek-V3, the technical paper said. RedNote, also known by its Chinese name Xiaohongshu, is an Instagram-like platform where users share photos, videos, text posts and live streams. The platform gained international attention earlier this year when some U.S. users flocked to the app amid concerns over a potential TikTok ban. The company has invested in large language model development since 2023, not long after OpenAI's release of ChatGPT in late 2022. It has accelerated its AI efforts in recent months, launching Diandian, an AI-powered search application that helps users find content on Xiaohongshu's main platform. Other companies that are pursuing an open-source approach include Alibaba which launched Qwen 3, an upgraded version of its model in April. Earlier this year, startup DeepSeek released its low-cost R1 model as open-source software, shaking up the global AI industry due to its competitive performance despite being developed at a fraction of the cost of Western rivals.