4 days ago
- Business
- Harvard Business Review
To Create Value with AI, Improve the Quality of Your Unstructured Data
A company's content lies largely in 'unstructured data'—those emails, contracts, forms, Sharepoint files, recordings of meetings and so forth created via work processes. That proprietary content makes gen AI more distinctive, more knowledgeable about your products and services, less likely to hallucinate, and more likely to bring economic value. As a chief data officer we interviewed pointed out, 'You're unlikely to get much return on your investment by simply installing CoPilot.'
Many companies have concluded that the most value from gen AI lies in combining the astounding language, reasoning, and general knowledge of large language models (LLMs) with their own proprietary content. That combination is necessary, for example, in enterprise-level gen AI applications in customer service, marketing, legal, and software development, and product/service offerings for customer use.
The most common approach by far to adding a company's own content is 'retrieval augmented generation' or RAG, which combines traditional information-gathering tools like databases with information retrieved by LLMs. It is used because submitting vast quantities of content in a prompt is often technically infeasible or expensive. While technically complex, the RAG approach is quite feasible and yields accurate responses to user prompts if the unstructured data used in RAG is of high quality. Therein lies the problem. Unstructured data is frequently of poor quality—obsolete, duplicative, inaccurate, and poorly-structured, among other problems.
Most companies have not done well with the quality of structured data, even as this data is used every day to complete business transactions and understand performance. Unstructured data is tougher. The last serious attempts to address unstructured data date to the 1990s and 2000s when knowledge management was popular. Most efforts proved unsuccessful. Surveys confirm that most leaders are aware that poor quality hampers their generative AI efforts, and that they did not have a strong focus on unstructured data until the advent of gen AI.
Of course, the best way to deal with data quality problems is to prevent them. Over the long-term, companies serious about AI must develop programs to do just that. Those who create documents, for example, need to learn to evaluate them for quality and tag key elements. But this will take much concerted effort and is no help in the short term. To get value from gen AI, companies need to build RAG applications using high-quality unstructured data. Our objective in this article is to help them do so by summarizing the most important data problems and the best approaches for dealing with them, both human and technical.
What Is Data Quality for Unstructured Data?
High-quality data, whether structured or unstructured, only results from focused effort, led by active, engaged leadership, some well-placed professionals, clear management responsibilities for all who touch data, and a relentless commitment to continuous improvement. Absent these things, chances are high your data is not up-to-snuff. As coach and advisor, Alex Borek of the Data Masterclass told us, 'When AI doesn't work, it often reveals flaws in the human system.'
Indeed, the best estimate is that 80% of time spent on an AI project will be devoted to data. For example, a Philippines-based Morgan Stanley team spent several years curating research reports in advance of their AI @ Morgan Stanley assistant project. The curation started before gen AI became widespread, which allowed Morgan Stanley to more quickly get their application into production.
To work effectively, RAG requires documents directly relevant to the problem at hand, a minimum of duplicated content, and the information contained in those documents to be complete, accurate, and up-to-date. Further, as Seth Earley of Earley Information Science noted, 'You must supply context, as much as possible, if a LLM is to properly interpret these documents.'
Unstructured data does not come pre-loaded with the needed context, and gen AI is largely incapable of determining what is the best information to solve a particular business question or issue. It is also not good at 'entity resolution,' i.e., 'Is this 'John Smith' in document A, about customers, the same person as 'J. A. Smith' in that document B, about vendors, and/or the same person as 'Mr. J Smith' in the other document C, about a donation to our foundation?'
Most structured data is defined in a data model or dictionary. This provides some context and helps reduce the John Smith/J. A. Smith problem described above. For structured data it is easier to find the data desired, learn who is responsible for it, and understand what the data means. As John Duncan, the head of data governance for the large car retailer CarMax told us, unstructured data also requires the same need for clarity on data ownership, producers, consumers, and stewards. It also benefits from standards for data quality thresholds, data lineage, access controls, and retention durations. This metadata is typically included in a data dictionary.
However, with unstructured data, there is seldom a dictionary. Often there is no centralized management of such content; documents are stored haphazardly using different naming conventions and on different computers or cloud providers across the company. There is often no common definition of a content type; an ad agency data leader confessed that there is no common definition of a 'pitch' across the agency. Finally, unstructured documents were often developed with a different purpose than feeding gen AI. A contract with a supplier, for example, was not designed to provide insight about the level of risk in a supplier relationship. We believe it was the late management thinker Charles Handy who observed, 'Information gathered for one purpose is seldom useful for another.'
An Unstructured Data Quality Process
Fortunately, there are several approaches and tools that can help to improve unstructured data. We recommend that all AI projects follow a disciplined process, building quality in wherever they can. Such a process must embrace the following steps:
Address unstructured data quality issues problem by problem, not all at once.
Identify and assess the data to be used.
Assemble the team to address the problem.
Prepare the data, employing both humans (D1) and AI (D2), when possible.
Develop your application and validate that it works.
Support the application and try to inculcate quality in content creation processes.
1. Address unstructured data quality issues problem by problem, not all at once.
There is too much unstructured data to improve all at once. Project leaders should ensure that all involved agree on the problem/opportunity to be addressed. Priorities should be based first on the value to the business of solving the problem, and second on the feasibility and cost of developing a solution—including data quality improvement.
Areas of the business with data that is already of reasonably good quality should receive higher priority. That's the approach Nelson Frederick Bamundagea, IT director at the truck refrigeration servicing company W&B Services, has taken. His knowledge retrieval application for service technicians uses the schematics of some 20 (refrigerator) models provided by two manufacturers. These have been used over and over and the vocabulary employed is relatively small, providing for a high level of trust. More generally, Alex Borek advises companies to 'first look to highly curated data products whenever possible.'
2. Identify and assess the data to be used.
Since the data is critical to the success of an LLM-based knowledge project, it's important to assess the data at an early stage. There is a human tendency to include any possibly relevant document in a RAG, but companies should adopt a healthy skepticism and a 'less is more' philosophy: absent a good reason to trust a document or content source, don't include it.
It's not likely that experts can evaluate every document, but they can dig deeply into a small sample. Are the sample documents loaded with errors, internal inconsistencies, or confusing language—or are they relatively clean? Use your judgment: Keep clean data and proceed with caution; toss bad data. If the data are in horrible shape or you can't find enough good data, reconsider the project.
3. Assemble the team to address the problem.
Given the need for some human curation of unstructured data, it's unlikely that a small team of experts can accomplish the necessary work. In addition, those who work with the data from day-to-day typically have a better idea of what constitutes high quality and how to achieve it. In many cases, then, it may be helpful to make data quality improvement a broadly participative project. For example, at Scotiabank, the contact center organization needed to curate documents for a customer chatbot. Center staff took responsibility for the quality of its customer support knowledge base and ensured that each document fed into the RAG-based chatbot was clear, unique, and up to date.
4a. Prepare the data.
If you've concluded—and you should—that there must be a human contribution to improving unstructured data quality, this is the time to engage it. That contribution could include having a stakeholder group agree on the key terms—e,g., 'contract,' 'proposal,' 'technical note,' and 'customer' might be examples—and how they are defined. Document this work in a business glossary. This can be hard: Consistent with 'Davenport's Law'—first stated more than 30 years ago —the more an organization knows or cares about a particular information element, the less likely it is to have a common term and meaning for it. This issue can be overcome through 'data arguing' (not data architecture) until the group arrives at a consensus.
And, of course, if there is a human curation role, this is the time to begin it. That entails deciding which documents or content sources are the best for a particular issue, 'tagging' it with metadata, and scoring content on such attributes as recency, clarity, and relevance to the topic. Morgan Stanley has a team of 20 or so analysts based in the Philippines that scores each document along 20 different criteria.
4b. Prepare the data with AI.
Gen AI itself is quite good at some tasks needed to prepare unstructured data for other gen AI applications. It can, for example, summarize content, classify documents by category of content, and tag key data element. For example, CarMax uses generative AI to translate different car manufacturers' specific language for describing automotive components and capabilities into a standard set of descriptions that is meant to enable a consumer to compare cars across manufacturers.
Gen AI can also create good first drafts of 'knowledge graphs,' or displays of what information is related to other information in a network. Knowledge graphs improve the ability of RAG to find the best content quickly. Gen AI is also good at de-duplication, or the process of finding exact or very similar copies of documents and eliminating all but one. Since RAG approaches pick documents based on specified criteria, these criteria (recency, authorship, etc.) can be changed ('re-ranked') to give higher weight to certain ones in content search.
We have found, however, that AI is not particularly good at identifying the best document in a set of similar ones, even when given a grading rubric. For that and reviewing tasks humans are still necessary. As a starting point, we recommend using humans to figure what needs to be done, and machines to increase scale and decrease unit cost in execution.
5. Develop your application and validate that it works.
The process of developing a RAG models from curated data involves several rather technical steps, best performed by qualified tech staff. Even after having done everything possible to prepare the data, it is essential that organizations rigorously test their RAG applications before putting them into production. This is particularly important for applications that are highly regulated or involve human well-being. One way to validate the model involves identifying '50 Golden Questions,' in which a team identifies questions that the RAG must get right, determines whether it does so, and acts accordingly. The validation should be done over time given that foundational LLMs change often.
When a European insurer tried to validate its system for knowledge on how to address claims, it found that customers' contracts, call center personnel, the company's knowledge base, and the claims department often disagreed. This led the company to clarify that the Claims Department 'owned' the answer, i.e., served as the 'gold standard.' Changes to the chatbot, customer contracts, and call center training followed.
6. Support the application and try to inculcate ongoing quality.
As a practical matter, no RAG application will enjoy universal acclaim the minute it is deployed. The application can still hallucinate, there will be bugs to be worked out and some level of customer dissatisfaction. We find that some discount a well-performing RAG application if it makes any errors whatsoever. Finally, changes will be needed as the application is used in new ways. So, plan for ongoing quality management and improvement. The plan should include:
Some amount of 'qualified-human-in-the-loop' especially in more critical situations
A means to trap errors, conduct root cause analysis, and prevent them going forward
Efforts to understand who the customers of the RAG are, how they use it, and how they define 'good'
Feedback to managers responsible for business processes that create unstructured data to improve future inputs. Content creators can be trained, for example, to create higher-quality documents, tag them as they create, and add them to a central repository.
It appears that RAG, featuring proprietary content, combined with LLMs, is going to be with us for the foreseeable future. It is one of the best ways to gain value from gen AI if one can feed models high-quality unstructured data. We know there is a lot here, but it is certainly within reach of those who buckle down and do the work.