
To Create Value with AI, Improve the Quality of Your Unstructured Data
A company's content lies largely in 'unstructured data'—those emails, contracts, forms, Sharepoint files, recordings of meetings and so forth created via work processes. That proprietary content makes gen AI more distinctive, more knowledgeable about your products and services, less likely to hallucinate, and more likely to bring economic value. As a chief data officer we interviewed pointed out, 'You're unlikely to get much return on your investment by simply installing CoPilot.'
Many companies have concluded that the most value from gen AI lies in combining the astounding language, reasoning, and general knowledge of large language models (LLMs) with their own proprietary content. That combination is necessary, for example, in enterprise-level gen AI applications in customer service, marketing, legal, and software development, and product/service offerings for customer use.
The most common approach by far to adding a company's own content is 'retrieval augmented generation' or RAG, which combines traditional information-gathering tools like databases with information retrieved by LLMs. It is used because submitting vast quantities of content in a prompt is often technically infeasible or expensive. While technically complex, the RAG approach is quite feasible and yields accurate responses to user prompts if the unstructured data used in RAG is of high quality. Therein lies the problem. Unstructured data is frequently of poor quality—obsolete, duplicative, inaccurate, and poorly-structured, among other problems.
Most companies have not done well with the quality of structured data, even as this data is used every day to complete business transactions and understand performance. Unstructured data is tougher. The last serious attempts to address unstructured data date to the 1990s and 2000s when knowledge management was popular. Most efforts proved unsuccessful. Surveys confirm that most leaders are aware that poor quality hampers their generative AI efforts, and that they did not have a strong focus on unstructured data until the advent of gen AI.
Of course, the best way to deal with data quality problems is to prevent them. Over the long-term, companies serious about AI must develop programs to do just that. Those who create documents, for example, need to learn to evaluate them for quality and tag key elements. But this will take much concerted effort and is no help in the short term. To get value from gen AI, companies need to build RAG applications using high-quality unstructured data. Our objective in this article is to help them do so by summarizing the most important data problems and the best approaches for dealing with them, both human and technical.
What Is Data Quality for Unstructured Data?
High-quality data, whether structured or unstructured, only results from focused effort, led by active, engaged leadership, some well-placed professionals, clear management responsibilities for all who touch data, and a relentless commitment to continuous improvement. Absent these things, chances are high your data is not up-to-snuff. As coach and advisor, Alex Borek of the Data Masterclass told us, 'When AI doesn't work, it often reveals flaws in the human system.'
Indeed, the best estimate is that 80% of time spent on an AI project will be devoted to data. For example, a Philippines-based Morgan Stanley team spent several years curating research reports in advance of their AI @ Morgan Stanley assistant project. The curation started before gen AI became widespread, which allowed Morgan Stanley to more quickly get their application into production.
To work effectively, RAG requires documents directly relevant to the problem at hand, a minimum of duplicated content, and the information contained in those documents to be complete, accurate, and up-to-date. Further, as Seth Earley of Earley Information Science noted, 'You must supply context, as much as possible, if a LLM is to properly interpret these documents.'
Unstructured data does not come pre-loaded with the needed context, and gen AI is largely incapable of determining what is the best information to solve a particular business question or issue. It is also not good at 'entity resolution,' i.e., 'Is this 'John Smith' in document A, about customers, the same person as 'J. A. Smith' in that document B, about vendors, and/or the same person as 'Mr. J Smith' in the other document C, about a donation to our foundation?'
Most structured data is defined in a data model or dictionary. This provides some context and helps reduce the John Smith/J. A. Smith problem described above. For structured data it is easier to find the data desired, learn who is responsible for it, and understand what the data means. As John Duncan, the head of data governance for the large car retailer CarMax told us, unstructured data also requires the same need for clarity on data ownership, producers, consumers, and stewards. It also benefits from standards for data quality thresholds, data lineage, access controls, and retention durations. This metadata is typically included in a data dictionary.
However, with unstructured data, there is seldom a dictionary. Often there is no centralized management of such content; documents are stored haphazardly using different naming conventions and on different computers or cloud providers across the company. There is often no common definition of a content type; an ad agency data leader confessed that there is no common definition of a 'pitch' across the agency. Finally, unstructured documents were often developed with a different purpose than feeding gen AI. A contract with a supplier, for example, was not designed to provide insight about the level of risk in a supplier relationship. We believe it was the late management thinker Charles Handy who observed, 'Information gathered for one purpose is seldom useful for another.'
An Unstructured Data Quality Process
Fortunately, there are several approaches and tools that can help to improve unstructured data. We recommend that all AI projects follow a disciplined process, building quality in wherever they can. Such a process must embrace the following steps:
Address unstructured data quality issues problem by problem, not all at once.
Identify and assess the data to be used.
Assemble the team to address the problem.
Prepare the data, employing both humans (D1) and AI (D2), when possible.
Develop your application and validate that it works.
Support the application and try to inculcate quality in content creation processes.
1. Address unstructured data quality issues problem by problem, not all at once.
There is too much unstructured data to improve all at once. Project leaders should ensure that all involved agree on the problem/opportunity to be addressed. Priorities should be based first on the value to the business of solving the problem, and second on the feasibility and cost of developing a solution—including data quality improvement.
Areas of the business with data that is already of reasonably good quality should receive higher priority. That's the approach Nelson Frederick Bamundagea, IT director at the truck refrigeration servicing company W&B Services, has taken. His knowledge retrieval application for service technicians uses the schematics of some 20 (refrigerator) models provided by two manufacturers. These have been used over and over and the vocabulary employed is relatively small, providing for a high level of trust. More generally, Alex Borek advises companies to 'first look to highly curated data products whenever possible.'
2. Identify and assess the data to be used.
Since the data is critical to the success of an LLM-based knowledge project, it's important to assess the data at an early stage. There is a human tendency to include any possibly relevant document in a RAG, but companies should adopt a healthy skepticism and a 'less is more' philosophy: absent a good reason to trust a document or content source, don't include it.
It's not likely that experts can evaluate every document, but they can dig deeply into a small sample. Are the sample documents loaded with errors, internal inconsistencies, or confusing language—or are they relatively clean? Use your judgment: Keep clean data and proceed with caution; toss bad data. If the data are in horrible shape or you can't find enough good data, reconsider the project.
3. Assemble the team to address the problem.
Given the need for some human curation of unstructured data, it's unlikely that a small team of experts can accomplish the necessary work. In addition, those who work with the data from day-to-day typically have a better idea of what constitutes high quality and how to achieve it. In many cases, then, it may be helpful to make data quality improvement a broadly participative project. For example, at Scotiabank, the contact center organization needed to curate documents for a customer chatbot. Center staff took responsibility for the quality of its customer support knowledge base and ensured that each document fed into the RAG-based chatbot was clear, unique, and up to date.
4a. Prepare the data.
If you've concluded—and you should—that there must be a human contribution to improving unstructured data quality, this is the time to engage it. That contribution could include having a stakeholder group agree on the key terms—e,g., 'contract,' 'proposal,' 'technical note,' and 'customer' might be examples—and how they are defined. Document this work in a business glossary. This can be hard: Consistent with 'Davenport's Law'—first stated more than 30 years ago —the more an organization knows or cares about a particular information element, the less likely it is to have a common term and meaning for it. This issue can be overcome through 'data arguing' (not data architecture) until the group arrives at a consensus.
And, of course, if there is a human curation role, this is the time to begin it. That entails deciding which documents or content sources are the best for a particular issue, 'tagging' it with metadata, and scoring content on such attributes as recency, clarity, and relevance to the topic. Morgan Stanley has a team of 20 or so analysts based in the Philippines that scores each document along 20 different criteria.
4b. Prepare the data with AI.
Gen AI itself is quite good at some tasks needed to prepare unstructured data for other gen AI applications. It can, for example, summarize content, classify documents by category of content, and tag key data element. For example, CarMax uses generative AI to translate different car manufacturers' specific language for describing automotive components and capabilities into a standard set of descriptions that is meant to enable a consumer to compare cars across manufacturers.
Gen AI can also create good first drafts of 'knowledge graphs,' or displays of what information is related to other information in a network. Knowledge graphs improve the ability of RAG to find the best content quickly. Gen AI is also good at de-duplication, or the process of finding exact or very similar copies of documents and eliminating all but one. Since RAG approaches pick documents based on specified criteria, these criteria (recency, authorship, etc.) can be changed ('re-ranked') to give higher weight to certain ones in content search.
We have found, however, that AI is not particularly good at identifying the best document in a set of similar ones, even when given a grading rubric. For that and reviewing tasks humans are still necessary. As a starting point, we recommend using humans to figure what needs to be done, and machines to increase scale and decrease unit cost in execution.
5. Develop your application and validate that it works.
The process of developing a RAG models from curated data involves several rather technical steps, best performed by qualified tech staff. Even after having done everything possible to prepare the data, it is essential that organizations rigorously test their RAG applications before putting them into production. This is particularly important for applications that are highly regulated or involve human well-being. One way to validate the model involves identifying '50 Golden Questions,' in which a team identifies questions that the RAG must get right, determines whether it does so, and acts accordingly. The validation should be done over time given that foundational LLMs change often.
When a European insurer tried to validate its system for knowledge on how to address claims, it found that customers' contracts, call center personnel, the company's knowledge base, and the claims department often disagreed. This led the company to clarify that the Claims Department 'owned' the answer, i.e., served as the 'gold standard.' Changes to the chatbot, customer contracts, and call center training followed.
6. Support the application and try to inculcate ongoing quality.
As a practical matter, no RAG application will enjoy universal acclaim the minute it is deployed. The application can still hallucinate, there will be bugs to be worked out and some level of customer dissatisfaction. We find that some discount a well-performing RAG application if it makes any errors whatsoever. Finally, changes will be needed as the application is used in new ways. So, plan for ongoing quality management and improvement. The plan should include:
Some amount of 'qualified-human-in-the-loop' especially in more critical situations
A means to trap errors, conduct root cause analysis, and prevent them going forward
Efforts to understand who the customers of the RAG are, how they use it, and how they define 'good'
Feedback to managers responsible for business processes that create unstructured data to improve future inputs. Content creators can be trained, for example, to create higher-quality documents, tag them as they create, and add them to a central repository.
It appears that RAG, featuring proprietary content, combined with LLMs, is going to be with us for the foreseeable future. It is one of the best ways to gain value from gen AI if one can feed models high-quality unstructured data. We know there is a lot here, but it is certainly within reach of those who buckle down and do the work.
Hashtags

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles


Bloomberg
16 minutes ago
- Bloomberg
Hegseth Warns About China Threat, Urges Asian Allies to Boost Defense Spending
Speaking at the Shangri-La Dialogue in Singapore, US Defense Secretary Pete Hegseth urged Washington's partners in Asia to increase defense spending and warned that "China seeks to become a hegemonic power" in the Indo-Pacific region. (Source: Other)


CBS News
20 minutes ago
- CBS News
Florida lawmakers reach budget agreement, set to finalize $900 million tax cut plan
Nearly a month after leaving the Capitol without passing a budget, House and Senate leaders said Friday night they had reached an agreement that will clear the way for lawmakers to begin hammering out details of a spending plan Tuesday. House Speaker Daniel Perez, R-Miami, and Senate President Ben Albritton, R-Wauchula, issued memos that indicated they expect to pass a budget for the 2025-2026 fiscal year on June 16. The memos came after weeks of behind-the-scenes talks aimed at trying to kick-start the conference negotiating process. Key financial priorities The memos said the agreement includes a $900 million tax cut through eliminating a tax on commercial leases, a longtime priority of business lobbyists. It also includes what the memos described as $350 million in "permanent sales tax exemptions targeted towards Florida families," $250 million in debt reduction and $750 million in annual payments into a state rainy-day fund. "In total, the framework set forth in these allocations provides for a fiscally responsible, balanced budget that reduces state spending, lowers per capita spending, and reduces the growth of state bureaucracy," Albritton wrote in his memo to senators. "The budget authorizes early payoff of state debt, accounts for significant, broad-based tax relief, and builds on historic state reserves for emergencies." Conference committees will start meeting Tuesday to negotiate details of the different parts of the budget, such as education, health care and criminal justice. Unresolved issues will go Thursday to House Budget Chairman Lawrence McClure, R-Dover, and Senate Appropriations Chairman Ed Hooper, R-Trinity, for further negotiations. The fiscal year will start July 1, which, if a budget passes June 16, will give Gov. Ron DeSantis two weeks to use his line-item veto authority. Past disagreements and new framework The House and Senate were unable to reach agreement on a budget before the scheduled May 2 end of the annual legislative session because of differences about tax cuts and spending levels. Lawmakers extended the session, but House and Senate leaders remained at odds as they worked behind the scenes. The House in April approved a plan that called for cutting the state sales-tax rate from 6 percent to 5.25 percent, which would have totaled roughly $5 billion. But the Senate did not go along and pitched a plan that included providing a sales-tax exemption on clothes and shoes valued at $75 or less, sales-tax "holidays" and trimming the commercial-lease tax. DeSantis, meanwhile, called for cutting property taxes and criticized the House's plan for reducing the sales-tax rate. Perez and Albritton indicated on May 2 that they had reached a "framework" that would include $2.8 billion in tax cuts, including reducing the sales-tax rate. But that later blew up, with Perez publicly accusing Albritton of backing out of the deal. But Albritton said senators had raised concerns that a cut in the sales-tax rate would not be "meaningful, felt, or seen by families and seniors when compared with other available options." The memos released Friday night did not provide details of the $350 million in sales-tax exemptions that are included in the latest agreement. They also did not mention property-tax cuts.


CBS News
21 minutes ago
- CBS News
Denver's Sun Valley neighborhood near Empower Field at Mile High gets full makeover
For years, Denver's Sun Valley neighborhood just south of Empower Field at Mile High has essentially been a construction site. The efforts to redevelop the area have been in the works for over a decade. But, slowly, community members are now returning to the place they call home. CBS "I've got to see the changes of it being neighbors walking outside, to everyone being in these new, nice buildings with air conditioning and dishwashers," neighborhood resident Maccarah Vaugh told CBS Colorado. "And seeing the kids grow up, it's been beautiful." Vaughn is a single mother who had to relocate during construction. The Denver Housing Authority tore down its public housing built in the 1950s to replace it with nearly three times the amount of housing, community areas, improved infrastructure, and even a small market. For her, it's been more than worth it now that she's back. "It has been a very positive thing," Vaughn said. "I see that it's helped us with being able to walk outside in our neighborhoods, on the sidewalks, at nighttime with the lights, and more people are out." DHA CEO Joaquín Cintrón Vega hopes other cities will take notice of their approach in this massive affordable housing redevelopment, which was heavily focused on asking community members for their input. "We are very intentional about creating this model that can be followed, not only regionally but nationally," Cintrón Vega said. "You know, making sure that, as we are tasked to provide places to go home and those beautiful apartments, we all are also mindful about the healthy environment that should be present for those families and individuals, and providing some additional amenities for them." DHA obtained over $30 million in federal grants and more than $60 million from a city of Denver bond program to spearhead the project. The final apartment building will open later this year, and, in total, seven new multifamily buildings serving nearly 1,000 households will be built. The final leg of the redevelopment project will establish Sun Valley Riverfront Park along the South Platte River. The first 5.5 acres of the eventual 11-acre recreational space will begin construction in 2026. According to DHA, the area has been home to some of the city's most vulnerable residents. With 94% of the housing market subsidized before redevelopment, it was important for them to improve the area, not push people out. "The hardest part of gentrification is the displacement," said Erin Clark, DHA's chief real estate investment officer. "Changing a community into something that is different than what it was before at its core -- and where the people who made that community valuable in the first place no longer have a place to call home or feel welcome there -- that is exactly the opposite of what we've sought to do here."