
Why Unstructured Data Is Sorting Itself Out
Lego pieces for sale at a Lego Store in Annapolis, Maryland, on April 7, 2025. Earlier in March ... More 2025, Lego's CEO told AFP that US President Donald Trump's tariff threats do not keep him up at night, as the world's largest toymaker on Tuesday posted record earnings for 2024. Sales rose 13 percent to 74.3 billion kroner ($10.8 billion) last year, while net profit grew five percent to 13.8 billion kroner. (Photo by Jim WATSON / AFP) (Photo by JIM WATSON/AFP via Getty Images)
Information, without order, is chaotic. Attempting to work with data without structure and form is rather like watching white noise fuzz on an un-cabled television set, where shapes are almost familiar, but devoid of any recognizable manifestation. Unstructured data inside organizations appears to be full of energy, but it is weighed down by an inertia which precludes it from being useful, primarily because it doesn't know which home (application) it belongs to.
To define the term, let's first say that structured data includes spreadsheets with their formalized rows and columns, 'form-based' data resources where we know the fields in a document and so we know what values to expect… and of course relational databases, the purest form of an ordered and structured data repository. Unstructured data, therefore, includes non-tabular data spanning records of phone calls and voicemails, it is raw video that has yet to get meta-tagged to explain its contents, it is blogs and web pages, it's emails and also social media posts in all their forms.
Some data that may appear structured (such as sensor data from surveillance and internet of things devices) is still essentially unstructured i.e. 6,000 temperature readings and gyroscope movement records aren't necessarily structured just because they are numbered by sequence; they need to be extracted, parsed, deduplicated and manipulated to become structured for productive use. In so many cases, unstructured data is regarded as an untapped source of real business context, but it is often the hardest to bring in line, the hardest to govern and the toughest to operationalize.
Technology analyst house IDC refers to the unclassified morass of information as the 'unseen data conundrum' and estimates that unsiloed reserves of unstructured data now make up 'the majority of enterprise information' today. IDC also suggests that it is more than doubling (growing 55%) each year. These data blind spots are thought to create operational risk and to potentially undermine the value of AI. This is important now because organizations are using unstructured data to power large language models and retrieval-augmented generation applications.
There's a whole marketplace structure of unstructured technology toolset vendors today. Amazon Web Services (AWS) offers an entire menu of functions in this space. Amazon Comprehend is a natural language processing and machine learning service capable of extracting metadata, extracting key phrases and determining sentiment from text in multiple languages. AWS positions this service alongside the Amazon Transcribe speech-to-text tools, the quirkily named Amazon Rekognition image and video analysis service… and there's also Amazon Textract, which extracts metadata from scanned documents and images.
Given the breadth of AWS services in this market, it would be reasonable to expect similar-but-skewed proprietary versions of these functions in the major cloud service provider hyperscalers. Microsoft Azure Cosmos DB is a globally distributed, multi-model database with enough intelligence to be able to manage structured, semi-structured and unstructured data. This cloud-native database might be used alongside the playfully named Microsoft Blob Storage service, an object storage service designed for storing large amounts of unstructured data that might exist in images, videos, documents and other binary data. Also from Microsoft, AI Document Intelligence uses machine learning to extract text, key-value pairs, tables and structures from documents automatically.
Not to be left out, Google Cloud Platform also works at this level. The cloud and search giant points to its BigQuery brand and the object tables function within it. 'Object tables provides a structured record interface for unstructured data stored in Google Cloud Storage. This enables [users]
Given the services that exist as fairly prominent functions in the major cloud providers and from the toolsets that exist from more specialized players, working with unstructured data is clearly now a more pressing need. Often referred to as enterprise content management, ECM is certainly growing in the combined shadow of big data analytics and and rise of artificial intelligence.
The natural evolution for a data market like this is the arrival of industry-specific services aligned to industry verticals. Known for its work in unstructured data management across the healthcare industry, Hyland treads a careful line with its messaging as the company clearly wants to be seen as applicable to all use cases. The company says Hyland Content Intelligence turn unstructured data into actionable, AI-ready content with the 2025 arrival of its Knowledge Enrichment (currently in Beta) service being among its star players.
Related technologies are also present at IBM in the form of Watson Discovery for unstructured search and AI; Elastic for indexing and querying of unstructured text and logs; Cloudera for Hadoop-based data lake services across unstructured and semi-structured data; Databricks, Collibra, Alation, Palantir and Varonis, to name but a mouthful, there is a lot of structure being applied to the unstructured data space.
'Unstructured data remains a black box for most organizations, [especially] as it becomes critical for AI and business operations,' said Jay Limburn, chief product officer at Ataccama. 'Without a way to structure, govern and trust that information, enterprises risk missing the full value of their data.'
Limburn points to his firm's Ataccama One platform as a means to combine data quality, governance, observability, lineage and master data management. Ataccama One is now available on Snowflake Marketplace as a new integration with Document AI, a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model used to extract data from documents.
This fusion of data structuring services is billed as a means of turning unstructured content, such as contracts, invoices and PDFs, into structured data by running models directly within Snowflake. Businesspeople can use natural language prompts, such as 'What is the effective date of the contract?', which are then processed by Snowflake to create structured outputs written directly into Snowflake tables.
Where does the unstructured marketplace go next? If we accept the proposition that AI services are partly responsible for the surge in this sector (or let's at least call it a sub-surge in a sub-sector), then we might actually see AI services themselves starting to shoulder the responsibility for structuring our unstructuredness.
Given the current debate over whether chat-based AI services will take over browser search - and the fact that OpenAI offers GPT-based APIs for text extraction, summarization, semantic intent analysis and classification - that might be exactly what happens.

Try Our AI Features
Explore what Daily8 AI can do for you:
Comments
No comments yet...
Related Articles
Yahoo
32 minutes ago
- Yahoo
Jaguar Land Rover warns US tariffs will hit profit margins
Jaguar Land Rover (JLR) has downgraded its profit predictions for the year as it warned over the impact of President Donald Trump's tariffs and heightened global uncertainty. The car giant has cut its cashflow expectations in a presentation to investors at its headquarter in Warwickshire. JLR, which is owned by India's Tata, said it expects margins on underlying profits of between 5% and 7% this financial year. It had previously pointed towards 10% for the year, while it posted an underlying profit margin of 8.5% for the year to March. The firm is the UK's largest employer in the automotive sector, with facilities in the West Midlands and Merseyside building its Range Rover SUV models across the UK. JLR, which manufactures its Defender models in Slovakia, halted all shipments to the US in April after President Trump's administration imposed an additional 25% tariff on car imports. It restarted shipments last month after the UK reached an agreement to export 100,000 cars a year to the US at a reduced 10% tariff. The company told investors it is seeking to offset tariffs by reallocating vehicles 'to accessible markets' and potentially increasing prices in the US. Free cashflow for this financial year is also expected to be close to zero due to financial pressures, the company added. Bosses stressed that the company is still committed to its long-term investment plans and still expects a 'resilient financial performance' in the face of wider uncertainty. Tata Motors shares dropped by 3.8% on Monday as a result.
Yahoo
33 minutes ago
- Yahoo
2 Cash-Heavy Stocks with Promising Prospects and 1 to Approach with Caution
A cash-heavy balance sheet is often a sign of strength, but not always. Some companies avoid debt because they have weak business models, limited expansion opportunities, or inconsistent cash flow. Not all businesses with cash are winners, and that's why we built StockStory - to help you separate the good from the bad. Keeping that in mind, here are two companies with net cash positions that can leverage their balance sheets to grow and one that may struggle. Net Cash Position: $203.7 million (13.7% of Market Cap) Born out of a failed voice recognition startup by founder Spenser Skates, Amplitude (NASDAQ:AMPL) is data analytics software helping companies improve and optimize their digital products. Why Does AMPL Fall Short? Offerings struggled to generate meaningful interest as its average billings growth of 8.7% over the last year did not impress Suboptimal cost structure is highlighted by its history of operating margin losses Lacking free cash flow generation means it has few chances to reinvest for growth, repurchase shares, or distribute capital Amplitude is trading at $11.36 per share, or 4.3x forward price-to-sales. If you're considering AMPL for your portfolio, see our FREE research report to learn more. Net Cash Position: $737.3 million (1.3% of Market Cap) One of the oldest service providers in the industry, Paychex (NASDAQ:PAYX) offers its customers payroll and HR software solutions. Why Do We Like PAYX? Estimated revenue growth of 15.5% for the next 12 months implies demand will accelerate from its three-year trend Highly efficient business model is illustrated by its impressive 41.5% operating margin Robust free cash flow margin of 29.5% gives it many options for capital deployment At $153 per share, Paychex trades at 8.8x forward price-to-sales. Is now the time to initiate a position? Find out in our full research report, it's free. Net Cash Position: $37 million (0% of Market Cap) With roots dating back to 1833, making it one of America's oldest continuously operating businesses, McKesson (NYSE:MCK) is a healthcare services company that distributes pharmaceuticals, medical supplies, and provides technology solutions to pharmacies, hospitals, and healthcare providers. Why Will MCK Outperform? 13.9% annual revenue growth over the last two years surpassed the sector average as its offerings resonated with customers Unparalleled scale of $359.1 billion in revenue gives it negotiating leverage and staying power in an industry with high barriers to entry Share repurchases over the last five years enabled its annual earnings per share growth of 17.2% to outpace its revenue gains McKesson's stock price of $728.06 implies a valuation ratio of 19.8x forward P/E. Is now the right time to buy? See for yourself in our full research report, it's free. The market surged in 2024 and reached record highs after Donald Trump's presidential victory in November, but questions about new economic policies are adding much uncertainty for 2025. While the crowd speculates what might happen next, we're homing in on the companies that can succeed regardless of the political or macroeconomic environment. Put yourself in the driver's seat and build a durable portfolio by checking out our Top 5 Growth Stocks for this month. This is a curated list of our High Quality stocks that have generated a market-beating return of 183% over the last five years (as of March 31st 2025). Stocks that made our list in 2020 include now familiar names such as Nvidia (+1,545% between March 2020 and March 2025) as well as under-the-radar businesses like the once-micro-cap company Kadant (+351% five-year return). Find your next big winner with StockStory today for free. Find your next big winner with StockStory today. Find your next big winner with StockStory today Sign in to access your portfolio
Yahoo
33 minutes ago
- Yahoo
Trump Media deepens crypto push with Bitcoin–Ether ETF filing
(Bloomberg) — Donald Trump's media company has filed to launch an exchange-traded fund that would invest directly in both Bitcoin and Ether, the latest in a wider push into digital assets tied to the president's personal brand. Shuttered NY College Has Alumni Fighting Over Its Future As Part of a $45 Billion Push, ICE Prepares for a Vast Expansion of Detention Space As American Architects Gather in Boston, Retrofits Are All the Rage The proposed fund, dubbed the Truth Social Bitcoin and Ethereum ETF, was disclosed in a regulatory filing Monday. It aims to give investors an easy way to gain exposure to the two largest digital assets and serves as a 'simple and cost-effective' alternative to using peer-to-peer networks or digital platforms. Yorkville America Digital is listed as the sponsor of the latest product. Trump Media & Technology Group Corp. — the company behind Truth Social and majority-owned by the president — has been accelerating its digital-asset push. The firm recently announced plans to borrow money to buy Bitcoin, and previously said it would invest in the very ETFs it aims to issue. Earlier in June, Trump Media applied for a separate ETF focused solely on Bitcoin — a category already crowded with similar offerings. The latest filing comes amid a fresh flurry of activity across Trump's sprawling business interests. On Monday, his sons Eric Trump and Donald Trump Jr. unveiled a Trump-branded mobile phone service, relying on networks and hardware marketed as 'made in America.' The president's deepening crypto ties have drawn criticism from ethics experts, who point to the potential for financial gain in areas where Trump also sets policy. The White House has said the president is walled off from his namesake businesses. He has transferred roughly $4 billion worth of Trump Media shares to a trust controlled by Trump Jr. Crypto ETFs have been popular with investors this year. The iShares Bitcoin Trust has drawn $12.5 billion of inflows in 2025, bringing its assets to $70 billion, while Ether-focused products have also seen traction. Having taken in more than $2 billion collectively, two Ether funds — the iShares Ethereum Trust ETF and the 2x Ether ETF — are the second- and third-biggest haulers of cash so far this year among all crypto ETFs in the US, data compiled by Bloomberg show. American Mid: Hampton Inn's Good-Enough Formula for World Domination The Spying Scandal Rocking the World of HR Software How a Tiny Middleman Could Access Two-Factor Login Codes From Tech Giants New Grads Join Worst Entry-Level Job Market in Years As Companies Abandon Climate Pledges, Is There a Silver Lining? ©2025 Bloomberg L.P. Sign in to access your portfolio