8 hours ago
Data products and services are playing a new role in business.
Data mechanics isn't an industry. Discussion in this space tends to gravitate around the notion of 'data science' as a more pleasing umbrella term. Perhaps borrowing half its name from the core practice of computer science (the label usually put on university studies designed to qualify software application developers who want to program), there is a constant push for development in the data mechanics and management space, even though we're now well over half a century on from arrival of the first database systems.
Although data mechanics may not be an industry, it is (in various forms) a company. Data-centric cloud platform company NetApp acquired Data Mechanics back in 2021. Known as a managed platform provider for big data processing and cloud analytics, NetApp wanted Data Mechanics to capitalize on the growing interest in Apache Spark, the open source distributed processing system for big data workloads.
But the story doesn't end there, NetApp sold off some of its acquisitions that work at this end of the data mechanics space to Flexera, which makes some sense as NetApp is known for its storage competencies and as the intelligent data infrastructure company, after all. Interestingly, NetApp confirmed that the divestiture of technologies at this level will often leave a residual amount of software engineering competencies (if not perhaps intellectual property in some organizations on occasions) within the teams that it still operates, so these actions have two sides to them.
NetApp is now turning its focus to expanding its work with some major technology partners to provide data engineering resources for the burgeoning AI industry. This means it is working with Nvidia on its AI Data Platform reference design via the NetApp AIPod service to (the companies both hope) accelerate enterprise adoption of agentic AI. It is also now offering NetApp AIPod Mini with Intel, a joint technology designed to streamline enterprise adoption of AI inferencing - and that data for AI thought is fundamental.
If there's one very strong theme surfacing in data mechanics right now, it's simple to highlight - the industry says: okay you've got data, but does your data work well for AI? As we know, AI is only as smart as what you tell it, so nobody wants garbage in, garbage out.
This theme won't be going away this year and it will be explained and clarified by organizations, foundations, evangelists and community groups spanning every sub-discipline of IT from DevOps specialists to databases to ERP vendors and everybody in between.
Operating as an independent business unit of Hitachi, Pentaho calls it 'data fitness' for the age of AI. The company is now focusing on expanding the capabilities of its Pentaho Data Catalog for this precise use. Essentially a data operations management service, this technology helps data scientists and developers know what and where their data is. It also helps monitor, classify and control data for analytics and compliance.
"The need for strong data foundations has never been higher and customers are looking for help across a whole range of issues. They want to improve the organization of data for operations and AI. They need better visibility into the 'what and where' of data's lifecycle for quality, trust and regulations. They also want to use automation to scale management with data while also increasing time to value," said Kunju Kashalikar, product management executive at Pentaho.
There's a sense of the industry wanting to provide back-end automations that shoulder the heavy infrastructure burdens associated with data wrangling on the data mechanic's workshop floor. Because organizations are now using a mix of datasets, (some custom-curated, some licenced, some anonymized, some just plain old data) they will want to know which ones they can trust at what level for different use cases.
Pentaho's Kashalikar suggests that those factors are what the company's platform has been aligned for. He points to its ability to now offer machine learning enhancements for data classification (that can also cope with unstructured data) designed to improve the ability to automate and scale how data is managed for expanding data ecosystems. These tools also offer integration with model governance controls, this increases visibility into how and where models are accessing data for both appropriate use and proactive governance.
The data mechanics (or data science) industry tends to use industrial factory terminology throughout its nomenclature. The idea of the data pipeline is intended to convey the 'journey' for data that starts its life in a raw and unclassified state, where it might be unstructured. The pipeline progresses through various filters that might include categorization and analytics. It might be coupled with another data pipeline in some form of join, or some of it may be threaded and channelled elsewhere. Ultimately, the data pipe reaches its endpoint, which might be an application, another data service or some form of machine-based data ingestion point.
Technology vendors who lean on this term are fond of laying claim to so-called end-to-end data pipelines, it is meant to convey breadth and span.
Proving that this part of the industry is far from done or static, data platform company Databricks has open sourced its core declarative extract, transform and load framework as Apache Spark Declarative Pipelines. Databricks CTO Matei Zaharia says that Spark Declarative Pipelines tackles one of the biggest challenges in data engineering, making it easy for data engineers to build and run reliable data pipelines that scale. He said end-to-end too, obviously.
Spark Declarative Pipelines provide a route to defining data pipelines for both batch (i.e. overnight) and streaming ETL workloads across any Apache Spark-supported data source. That means data sources including cloud storage, message buses, change data feeds and external systems. Zaharia calls it a 'battle-tested declarative framework' for building data pipelines that works well on complex pipeline authoring, manual operations overhead and siloed batch or streaming jobs.
'Declarative pipelines hide the complexity of modern data engineering under a simple, intuitive programming model. As an engineering manager, I love the fact that my engineers can focus on what matters most to the business. It's exciting to see this level of innovation now being open sourced, making it more accessible,' said Jian Zhou, senior engineering manager for Navy Federal Credit Union.
A large part of the total data mechanization process is unsurprisingly focused on AI and the way we handle large language models and the data they churn. What this could mean for data mechanics is not just new toolsets, but new workflow methodologies that treat data differently. This is the view of Ken Exner, chief product officer at search and operational intelligence company Elastic.
'What IT teams need to do to prepare data for use by an LLM is focus on the retrieval and relevance problem, not the formatting problem. That's not where the real challenge lies,' said Exner. 'LLMs are already better at interpreting raw, unstructured data than any ETL or pipeline tool. The key is getting the right private data to LLMs, at the right time… and in a way that preserves context. This goes far beyond data pipelines and traditional ETL, it requires a system that can handle both structured and unstructured data, understands real-time context, respects user permissions, and enforces enterprise-grade security. It's one that makes internal data discoverable and usable – not just clean.'
For Exner, this is how organizations will successfully be able to grease the data mechanics needed to make generative AI happen. It by unlocking the value of the mountains of (often siloed) private data that they already own, that's scattered across dozens (spoiler alert, it's actually often hundreds) of enterprise software systems.
As noted here, many of the mechanics playing out in data mechanics are aligned to the popularization of what the industry now agrees to call a data product.
As data now becomes a more tangible 'thing' in enterprise technology alongside servers, applications and maybe even keyboards, we can consider its use as more than just information; it has become a working component on the factor floor.