To Create Value with AI, Improve the Quality of Your Unstructured Data

28-05-2025

A company's content lies largely in 'unstructured data'—those emails, contracts, forms, Sharepoint files, recordings of meetings and so forth created via work processes. That proprietary content makes gen AI more distinctive, more knowledgeable about your products and services, less likely to hallucinate, and more likely to bring economic value. As a chief data officer we interviewed pointed out, 'You're unlikely to get much return on your investment by simply installing CoPilot.'
Many companies have concluded that the most value from gen AI lies in combining the astounding language, reasoning, and general knowledge of large language models (LLMs) with their own proprietary content. That combination is necessary, for example, in enterprise-level gen AI applications in customer service, marketing, legal, and software development, and product/service offerings for customer use.
The most common approach by far to adding a company's own content is 'retrieval augmented generation' or RAG, which combines traditional information-gathering tools like databases with information retrieved by LLMs. It is used because submitting vast quantities of content in a prompt is often technically infeasible or expensive. While technically complex, the RAG approach is quite feasible and yields accurate responses to user prompts if the unstructured data used in RAG is of high quality. Therein lies the problem. Unstructured data is frequently of poor quality—obsolete, duplicative, inaccurate, and poorly-structured, among other problems.
Most companies have not done well with the quality of structured data, even as this data is used every day to complete business transactions and understand performance. Unstructured data is tougher. The last serious attempts to address unstructured data date to the 1990s and 2000s when knowledge management was popular. Most efforts proved unsuccessful. Surveys confirm that most leaders are aware that poor quality hampers their generative AI efforts, and that they did not have a strong focus on unstructured data until the advent of gen AI.
Of course, the best way to deal with data quality problems is to prevent them. Over the long-term, companies serious about AI must develop programs to do just that. Those who create documents, for example, need to learn to evaluate them for quality and tag key elements. But this will take much concerted effort and is no help in the short term. To get value from gen AI, companies need to build RAG applications using high-quality unstructured data. Our objective in this article is to help them do so by summarizing the most important data problems and the best approaches for dealing with them, both human and technical.
What Is Data Quality for Unstructured Data?
High-quality data, whether structured or unstructured, only results from focused effort, led by active, engaged leadership, some well-placed professionals, clear management responsibilities for all who touch data, and a relentless commitment to continuous improvement. Absent these things, chances are high your data is not up-to-snuff. As coach and advisor, Alex Borek of the Data Masterclass told us, 'When AI doesn't work, it often reveals flaws in the human system.'
Indeed, the best estimate is that 80% of time spent on an AI project will be devoted to data. For example, a Philippines-based Morgan Stanley team spent several years curating research reports in advance of their AI @ Morgan Stanley assistant project. The curation started before gen AI became widespread, which allowed Morgan Stanley to more quickly get their application into production.
To work effectively, RAG requires documents directly relevant to the problem at hand, a minimum of duplicated content, and the information contained in those documents to be complete, accurate, and up-to-date. Further, as Seth Earley of Earley Information Science noted, 'You must supply context, as much as possible, if a LLM is to properly interpret these documents.'
Unstructured data does not come pre-loaded with the needed context, and gen AI is largely incapable of determining what is the best information to solve a particular business question or issue. It is also not good at 'entity resolution,' i.e., 'Is this 'John Smith' in document A, about customers, the same person as 'J. A. Smith' in that document B, about vendors, and/or the same person as 'Mr. J Smith' in the other document C, about a donation to our foundation?'
Most structured data is defined in a data model or dictionary. This provides some context and helps reduce the John Smith/J. A. Smith problem described above. For structured data it is easier to find the data desired, learn who is responsible for it, and understand what the data means. As John Duncan, the head of data governance for the large car retailer CarMax told us, unstructured data also requires the same need for clarity on data ownership, producers, consumers, and stewards. It also benefits from standards for data quality thresholds, data lineage, access controls, and retention durations. This metadata is typically included in a data dictionary.
However, with unstructured data, there is seldom a dictionary. Often there is no centralized management of such content; documents are stored haphazardly using different naming conventions and on different computers or cloud providers across the company. There is often no common definition of a content type; an ad agency data leader confessed that there is no common definition of a 'pitch' across the agency. Finally, unstructured documents were often developed with a different purpose than feeding gen AI. A contract with a supplier, for example, was not designed to provide insight about the level of risk in a supplier relationship. We believe it was the late management thinker Charles Handy who observed, 'Information gathered for one purpose is seldom useful for another.'
An Unstructured Data Quality Process
Fortunately, there are several approaches and tools that can help to improve unstructured data. We recommend that all AI projects follow a disciplined process, building quality in wherever they can. Such a process must embrace the following steps:
Address unstructured data quality issues problem by problem, not all at once.
Identify and assess the data to be used.
Assemble the team to address the problem.
Prepare the data, employing both humans (D1) and AI (D2), when possible.
Develop your application and validate that it works.
Support the application and try to inculcate quality in content creation processes.
1. Address unstructured data quality issues problem by problem, not all at once.
There is too much unstructured data to improve all at once. Project leaders should ensure that all involved agree on the problem/opportunity to be addressed. Priorities should be based first on the value to the business of solving the problem, and second on the feasibility and cost of developing a solution—including data quality improvement.
Areas of the business with data that is already of reasonably good quality should receive higher priority. That's the approach Nelson Frederick Bamundagea, IT director at the truck refrigeration servicing company W&B Services, has taken. His knowledge retrieval application for service technicians uses the schematics of some 20 (refrigerator) models provided by two manufacturers. These have been used over and over and the vocabulary employed is relatively small, providing for a high level of trust. More generally, Alex Borek advises companies to 'first look to highly curated data products whenever possible.'
2. Identify and assess the data to be used.
Since the data is critical to the success of an LLM-based knowledge project, it's important to assess the data at an early stage. There is a human tendency to include any possibly relevant document in a RAG, but companies should adopt a healthy skepticism and a 'less is more' philosophy: absent a good reason to trust a document or content source, don't include it.
It's not likely that experts can evaluate every document, but they can dig deeply into a small sample. Are the sample documents loaded with errors, internal inconsistencies, or confusing language—or are they relatively clean? Use your judgment: Keep clean data and proceed with caution; toss bad data. If the data are in horrible shape or you can't find enough good data, reconsider the project.
3. Assemble the team to address the problem.
Given the need for some human curation of unstructured data, it's unlikely that a small team of experts can accomplish the necessary work. In addition, those who work with the data from day-to-day typically have a better idea of what constitutes high quality and how to achieve it. In many cases, then, it may be helpful to make data quality improvement a broadly participative project. For example, at Scotiabank, the contact center organization needed to curate documents for a customer chatbot. Center staff took responsibility for the quality of its customer support knowledge base and ensured that each document fed into the RAG-based chatbot was clear, unique, and up to date.
4a. Prepare the data.
If you've concluded—and you should—that there must be a human contribution to improving unstructured data quality, this is the time to engage it. That contribution could include having a stakeholder group agree on the key terms—e,g., 'contract,' 'proposal,' 'technical note,' and 'customer' might be examples—and how they are defined. Document this work in a business glossary. This can be hard: Consistent with 'Davenport's Law'—first stated more than 30 years ago —the more an organization knows or cares about a particular information element, the less likely it is to have a common term and meaning for it. This issue can be overcome through 'data arguing' (not data architecture) until the group arrives at a consensus.
And, of course, if there is a human curation role, this is the time to begin it. That entails deciding which documents or content sources are the best for a particular issue, 'tagging' it with metadata, and scoring content on such attributes as recency, clarity, and relevance to the topic. Morgan Stanley has a team of 20 or so analysts based in the Philippines that scores each document along 20 different criteria.
4b. Prepare the data with AI.
Gen AI itself is quite good at some tasks needed to prepare unstructured data for other gen AI applications. It can, for example, summarize content, classify documents by category of content, and tag key data element. For example, CarMax uses generative AI to translate different car manufacturers' specific language for describing automotive components and capabilities into a standard set of descriptions that is meant to enable a consumer to compare cars across manufacturers.
Gen AI can also create good first drafts of 'knowledge graphs,' or displays of what information is related to other information in a network. Knowledge graphs improve the ability of RAG to find the best content quickly. Gen AI is also good at de-duplication, or the process of finding exact or very similar copies of documents and eliminating all but one. Since RAG approaches pick documents based on specified criteria, these criteria (recency, authorship, etc.) can be changed ('re-ranked') to give higher weight to certain ones in content search.
We have found, however, that AI is not particularly good at identifying the best document in a set of similar ones, even when given a grading rubric. For that and reviewing tasks humans are still necessary. As a starting point, we recommend using humans to figure what needs to be done, and machines to increase scale and decrease unit cost in execution.
5. Develop your application and validate that it works.
The process of developing a RAG models from curated data involves several rather technical steps, best performed by qualified tech staff. Even after having done everything possible to prepare the data, it is essential that organizations rigorously test their RAG applications before putting them into production. This is particularly important for applications that are highly regulated or involve human well-being. One way to validate the model involves identifying '50 Golden Questions,' in which a team identifies questions that the RAG must get right, determines whether it does so, and acts accordingly. The validation should be done over time given that foundational LLMs change often.
When a European insurer tried to validate its system for knowledge on how to address claims, it found that customers' contracts, call center personnel, the company's knowledge base, and the claims department often disagreed. This led the company to clarify that the Claims Department 'owned' the answer, i.e., served as the 'gold standard.' Changes to the chatbot, customer contracts, and call center training followed.
6. Support the application and try to inculcate ongoing quality.
As a practical matter, no RAG application will enjoy universal acclaim the minute it is deployed. The application can still hallucinate, there will be bugs to be worked out and some level of customer dissatisfaction. We find that some discount a well-performing RAG application if it makes any errors whatsoever. Finally, changes will be needed as the application is used in new ways. So, plan for ongoing quality management and improvement. The plan should include:
Some amount of 'qualified-human-in-the-loop' especially in more critical situations
A means to trap errors, conduct root cause analysis, and prevent them going forward
Efforts to understand who the customers of the RAG are, how they use it, and how they define 'good'
Feedback to managers responsible for business processes that create unstructured data to improve future inputs. Content creators can be trained, for example, to create higher-quality documents, tag them as they create, and add them to a central repository.
It appears that RAG, featuring proprietary content, combined with LLMs, is going to be with us for the foreseeable future. It is one of the best ways to gain value from gen AI if one can feed models high-quality unstructured data. We know there is a lot here, but it is certainly within reach of those who buckle down and do the work.

Hashtags

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Yahoo

a minute ago

Yahoo

Sinclair launches strategic review of broadcast unit, weighs spinoff of Ventures

(Reuters) -Sinclair said on Monday its board has authorized a strategic review of the company's broadcast business and it is considering a separation of its Ventures portfolio, sending its shares up 15% in extended trading. The review comes as media companies are exploring options for their cable TV businesses, as audiences rapidly abandon linear TV in favor of streaming platforms. Sinclair owns, operates and provides services to 178 television stations in 81 markets. The Ventures portfolio is comprised of Sinclair's private equity and real-estate assets, a cable network that includes coverage of most of tennis' top tournaments called the Tennis Channel, as well as its ad tech unit, Digital Remedy. "We expect separating Ventures will crystallize significant value that the market has overlooked within our current structure, giving us even more flexibility to drive our broadcast strategy forward," CEO Chris Ripley said. The Ventures business made nearly $11 million in minority investments during the second quarter. The company cautioned that the strategic review may not result in any transaction or change. For the quarter ended June 30, Sinclair's total revenues decreased 5% to $784 million. Error in retrieving data Sign in to access your portfolio Error in retrieving data Error in retrieving data Error in retrieving data Error in retrieving data

Trump meets with Intel CEO after demanding he resign

Yahoo

a minute ago

Yahoo

Trump meets with Intel CEO after demanding he resign

US President Donald Trump on Monday said he had a "very interesting" meeting with the chief of US chip maker Intel, just days after calling for his resignation. Trump said on his Truth Social platform that he met with Lip-Bu Tan along with Secretary of Commerce Howard Lutnick and Secretary of Treasury Scott Bessent. "The meeting was a very interesting one," Trump said in the post. "His success and rise is an amazing story." Trump added that members of his cabinet are going to spend time with Tan and bring the president "suggestions" next week. Intel did not respond to a request for comment. Trump demanded last week that the recently-hired boss of Intel resign "immediately," after a Republican senator raised national security concerns over his links to firms in China. "The CEO of INTEL is highly CONFLICTED and must resign, immediately. There is no other solution to this problem," Trump posted on Truth Social last Thursday. Tan released a statement at the time saying that the company was engaged with the Trump administration to address the concerns raised and ensure officials "have the facts." Intel is one of Silicon Valley's most iconic companies but its fortunes have been dwarfed by Asian powerhouses TSMC and Samsung, which dominate the made-to-order semiconductor business. In a statement, Tan said there has been "a lot of misinformation circulating" about his past roles at Walden International and Cadence Design Systems. "I have always operated within the highest legal and ethical standards," Tan said. The Malaysia-born tech industry veteran took the helm at struggling Intel in March, announcing layoffs as White House tariffs and export restrictions muddied the market. Intel's niche has been chips used in traditional computing processes, which are steadily being eclipsed by the AI revolution. gc/bjt Sign in to access your portfolio

Texas businessmen indicted for allegedly bribing officials at Mexico's Pemex

Yahoo

a minute ago

Yahoo

Texas businessmen indicted for allegedly bribing officials at Mexico's Pemex

By Stefanie Eschenbacher MEXICO CITY (Reuters) -Two Texas businessmen were indicted for allegedly bribing officials at Mexico's state energy company Pemex with $150,000 and luxury items to secure contracts, the U.S. Justice Department announced on Monday. Between 2019 and 2021, Ramon Rovirosa and Mario Avila, both Mexican citizens and U.S. lawful permanent residents, conspired to pay bribes to officials at Pemex and its exploration and production arm, known as PEP, according to an indictment unsealed in the Southern District of Texas. Rovirosa is also alleged to have ties to Mexican cartel members, the Department of Justice said in a statement. Rovirosa, 46, was arraigned while Avila, 61, remains at large. Reuters was unable to immediately contact the lawyers for Rovirosa and Avila. Pemex did not immediately respond to a request for comment. Together with co-conspirators, Rovirosa and Avila allegedly paid bribes in the form of cash and luxury goods, including from Louis Vuitton and Hublot, to at least three Pemex and PEP officials. In exchange, those Pemex officials are accused of helping companies associated with Rovirosa obtain contracts worth at least $2.5 million, the statement said. Mexico and its ailing state company Pemex have for decades been awash with corruption, with several former senior officials facing charges, including former Chief Executive Officer Emilio Lozoya. Lozoya, in turn, has accused ex-presidents Felipe Calderon and Carlos Salinas of corruption, along with former President Enrique Pena Nieto, his ex-finance minister, Luis Videgaray, and more than a dozen others. Andres Manuel Lopez Obrador, who was Mexican president during the time covered in the indictment, had vowed to root out the corruption that had plagued the country and its most important company for decades. Even so, Mexico's corruption ranking slipped. Rovirosa and Avila are each charged with one count of conspiracy to violate the Foreign Corrupt Practices Act and three substantive violations of it. The act makes it illegal for citizens, U.S. companies, or foreign persons and businesses in the United States to pay foreign officials to win business. Sign in to access your portfolio

To Create Value with AI, Improve the Quality of Your Unstructured Data

Hashtags

Try Our AI Features

Comments

Related Articles

Sinclair launches strategic review of broadcast unit, weighs spinoff of Ventures

Trump meets with Intel CEO after demanding he resign

Texas businessmen indicted for allegedly bribing officials at Mexico's Pemex

Get Started Now: Download the App