logo
#

Latest news with #dataextraction

Turning Data Into Insight With Document Processing Start-Ups
Turning Data Into Insight With Document Processing Start-Ups

Forbes

time30-07-2025

  • Business
  • Forbes

Turning Data Into Insight With Document Processing Start-Ups

Extracting high-quality data from large volumes of documents remains highly challenging Seemingly simple problems often defy simple solutions. Companies worldwide have almost endless amounts of potentially valuable data and intelligence locked up in documents such as Word files, PDFs, spreadsheets and even printed papers; mining that data for actionable insight should deliver huge benefits, right? In theory yes, but in practice, the process of automating accurate data extraction has proved hugely challenging, with conventional technologies struggling to interrogate complex documents in different formats. Now, however, generative artificial intelligence (GenAI) is promising to come to the rescue, prompting a boom in the intelligent document processing market – and a rash of new products and services from small start-ups and big tech. Research from Fortune Business Insights puts the value of this market at around $10.6 billion in 2025 but predicts it will be worth $66.7 billion by 2032; that's growth of more than 30% a year. 'Intelligent document processing refers to a workflow automation technology that mines, reads, scans and categorises data from documents to enhance business process automation,' explain Fortune's analysts. 'It combines optical character recognition (OCR) with AI machine learning algorithms to automate the handling of complex documents in different formats." San Francisco-based start-up Retab is one new entrant hoping to take advantage of advances in the field. The company, which is today announcing a $3.5 million pre-seed round, has developed a new platform that enables both developers and non-technical users such as analysts to automate the process of getting at the data they need. Retab's technology works alongside large language models from providers such as OpenAI, Google and Anthropic, ensuring they extract data from the user's document in such a way as to produce the most accurate results possible. 'Until now, it's taken teams of developers months to develop tools of sufficient quality,' says Louis de Benoist, co-founder and CEO of the company. 'People keep building demos that look like magic but break the moment you put them into production; Retab wraps the best models in a layer of logic that actually makes them usable with error handling and structured outputs.' Retab started out in the logistics sector, designing solutions to work with the wide range of variable documents that the industry generates, from bills of goods to invoices. Today, the company sees industries such as finance and healthcare as additional sectors where huge volumes of data need to be processed and where its solutions might therefore appeal. Today's seed funding round is led by early-stage funds including VentureFriends, Kima Ventures and K5 Global, as well as a number of angel investors. Those angels include Florian Douetteau, co-founder and CEO of Dataiku, who argues: 'The AI-fication of the economy depends on the capability to convert operations based on millions of documents into verified, structured data that autonomous systems can utilise; this process hinges on quality control, cost efficiency, and rapid implementation.' It's an argument that other players in an increasingly competitive market share. Other notable companies in the sector include ABBYY, Appian, Rossum, UiPath, though IBM and Microsoft also offer solutions. However, de Benoist is convinced that Retab can compete in this space, particularly with a software package aimed at analysts in businesses who lack specialist coding knowledge. 'The idea is to have a self-serve product that really appeals to a broad audience,' he says. 'It's the analysts who really understand in detail what they need from this data that need serving.' Fortune Business Insight's research suggests there should be plenty of new demand to go round. It sees finance and accounting as the sector's biggest customer but also points to procurement and HR as potentially significant buyers.

Chinese authorities are using a new tool to hack seized phones and extract data
Chinese authorities are using a new tool to hack seized phones and extract data

TechCrunch

time16-07-2025

  • TechCrunch

Chinese authorities are using a new tool to hack seized phones and extract data

Security researchers say Chinese authorities are using a new type of malware to extract data from seized phones, allowing them to obtain text messages — including from chat apps such as Signal — images, location histories, audio recordings, contacts, and more. On Wednesday, mobile cybersecurity company Lookout published a new report — shared exclusively with TechCrunch — detailing the hacking tool called Massistant, which the company said was developed by Chinese tech giant Xiamen Meiya Pico. Massistant, according to Lookout, is Android software used for the forensic extraction of data from mobile phones, meaning the authorities using it need to have physical access to those devices. While Lookout doesn't know for sure which Chinese police agencies are using the tool, its use is assumed widespread, which means both Chinese residents, as well as travelers to China, should be aware of the tool's existence and the risks it poses. 'It's a big concern. I think anybody who's traveling in the region needs to be aware that the device that they bring into the country could very well be confiscated and anything that's on it could be collected,' Kristina Balaam, a researcher at Lookout who analyzed the malware, told TechCrunch ahead of the report's release. 'I think it's something everybody should be aware of if they're traveling in the region.' Balaam found several posts on local Chinese forums where people complained about finding the malware installed on their devices after interactions with the police. 'It seems to be pretty broadly used, especially from what I've seen in the rumblings on these Chinese forums,' said Balaam. The malware, which must be planted on an unlocked device, and works in tandem with a hardware tower connected to a desktop computer, according to a description and pictures of the system on Xiamen Meiya Pico's website. Balaam said Lookout couldn't analyze the desktop component, nor could the researchers find a version of the malware compatible with Apple devices. In an illustration on its website, Xiamen Meiya Pico shows iPhones connected to its forensic hardware device, suggesting the company may have an iOS version of Massistant designed to extract data from Apple devices. Police do not need sophisticated techniques to use Massistant, such as using zero-days — flaws in software or hardware that have not yet been disclosed to the vendor — as 'people just hand over their phones,' said Balaam, based on what she's read on those Chinese forums. Since at least 2024, China's state security police have had legal powers to search through phones and computers without needing a warrant or the existence of an active criminal investigation. 'If somebody is moving through a border checkpoint and their device is confiscated, they have to grant access to it,' said Balaam. 'I don't think we see any real exploits from lawful intercept tooling space just because they don't need to.' A screenshot of the Massistant mobile forensic tool's hardware, taken from Xiamen Meiya Pico's official Chinese website. Image Credits:Xiamen Meiya Pico The good news, per Balaam, is that Massistant leaves evidence of its compromise on the seized device, meaning users can potentially identify and delete the malware, either because the hacking tool appears as an app, or can be found and deleted using more sophisticated tools such as the Android Debug Bridge, a command line tool that lets a user connect to a device through their computer. The bad news is that at the time of installing Massistant, the damage is done, and authorities already have the person's data. According to Lookout, Massistant is the successor of a similar mobile forensic tool, also made by Xiamen Meiya Pico, called MSSocket, which security researchers analyzed in 2019. Xiamen Meiya Pico reportedly has a 40% share of the digital forensics market in China, and was sanctioned by the U.S. government in 2021 for its role in supplying its technology to the Chinese government. The company did not respond to TechCrunch's request for comment. Balaam said that Massistant is only one of a large number of spyware or malware made by Chinese surveillance tech makers, in what she called 'a big ecosystem.' The researcher said that the company tracks at least 15 different malware families in China.

Unlock the Power of Data Extraction with Gemini CLI and MCP Servers
Unlock the Power of Data Extraction with Gemini CLI and MCP Servers

Geeky Gadgets

time01-07-2025

  • Geeky Gadgets

Unlock the Power of Data Extraction with Gemini CLI and MCP Servers

What if you could seamlessly integrate a powerful command-line tool with a server designed to handle complex data extraction workflows? Imagine automating the collection of structured data from platforms like LinkedIn or Amazon, all while maintaining precision, compliance, and efficiency. This is exactly what combining Gemini CLI with a Model Context Protocol (MCP) server offers. Whether you're a data scientist navigating intricate scraping scenarios or a business professional seeking actionable insights, this pairing unlocks a streamlined approach to managing and enhancing your data extraction processes. But as with any sophisticated system, the key lies in understanding how to configure and optimize these tools for maximum impact. In this deep dive, Prompt Engineering explores the step-by-step process of integrating Gemini CLI with an MCP server, using Bright Data as a prime example. You'll uncover how to configure essential settings like API tokens and rate limits, use advanced features such as structured queries and browser APIs, and even troubleshoot common challenges to ensure uninterrupted workflows. Along the way, we'll highlight how this integration not only simplifies data collection but also enables you to extract meaningful, actionable insights from even the most complex datasets. By the end, you'll see how these tools can transform your approach to data extraction, opening up new possibilities for efficiency and scalability. Integrating Gemini CLI with MCP Configuring Gemini CLI for MCP Servers To successfully integrate Gemini CLI with an MCP server, proper configuration is essential. The process begins with creating a ` file, which serves as the central repository for your API tokens, zones, and rate limits. This configuration ensures smooth communication between Gemini CLI and the MCP server, optimizing performance and reliability. Generate API tokens : Obtain API tokens from your MCP server account to enable secure authentication. : Obtain API tokens from your MCP server account to enable secure authentication. Set rate limits : Define rate limits to prevent overloading the server and maintain compliance with usage policies. : Define rate limits to prevent overloading the server and maintain compliance with usage policies. Define zones: Specify zones to outline the scope and focus of your data extraction activities. After completing these steps, restart Gemini CLI to apply the updated settings. This ensures the tool is fully prepared for your data extraction tasks, minimizing potential disruptions and maximizing efficiency. Maximizing Efficiency with Bright Data MCP Server Bright Data is a widely recognized MCP server, valued for its advanced web scraping capabilities and robust toolset. When integrated with Gemini CLI, it enables automated data collection from platforms such as LinkedIn, Amazon, and YouTube. Bright Data's specialized features are designed to address complex scraping scenarios, making it a powerful resource for extracting structured data. Web unlocker : Overcomes CAPTCHA challenges and other access restrictions, making sure uninterrupted data collection. : Overcomes CAPTCHA challenges and other access restrictions, making sure uninterrupted data collection. Browser APIs: Simulate user interactions, such as scrolling or clicking, to enable dynamic and comprehensive data extraction. These tools are particularly effective for gathering structured data, such as product specifications, user profiles, or video metadata. By using Bright Data's capabilities, you can ensure that your extracted data is both organized and actionable, supporting a wide range of analytical and operational needs. Guide to Integrating Gemini CLI with Model Context Protocol (MCP) Servers Watch this video on YouTube. Explore further guides and articles from our vast library that you may find relevant to your interests in Model Context Protocol (MCP). Core Features of MCP Servers MCP servers, including Bright Data, offer a variety of features designed to optimize data extraction workflows. These features provide users with the flexibility and precision needed to handle diverse data collection tasks. Structured queries : Enable precise and targeted data requests, reducing unnecessary processing and improving accuracy. : Enable precise and targeted data requests, reducing unnecessary processing and improving accuracy. URL-based inputs : Focus on specific web pages or sections to streamline data collection efforts. : Focus on specific web pages or sections to streamline data collection efforts. Error-handling tools : Address common issues such as timeouts or access restrictions, making sure reliable operations. : Address common issues such as timeouts or access restrictions, making sure reliable operations. Permission management: Maintain compliance with platform policies and legal requirements. For example, structured queries can be used to extract detailed information from LinkedIn profiles or YouTube videos, while permission management tools help ensure that your activities remain within acceptable boundaries. Overcoming Common Challenges While Gemini CLI and MCP servers are powerful tools, users may encounter challenges during setup or operation. Common issues include incorrect configuration of the ` file or difficulties disabling default tools, such as Google search, within Gemini CLI. Addressing these challenges often involves revisiting configuration files or consulting official documentation for detailed guidance. If persistent issues arise, consider running the Bright Data MCP server on a cloud desktop environment. This approach provides a stable and controlled platform for data extraction tasks, reducing the likelihood of disruptions and enhancing overall functionality. Enhancing Operations with Cloud Desktop Integration Setting up the Bright Data MCP server on a cloud desktop offers several advantages, particularly for users managing complex or large-scale data extraction projects. The process involves editing the ` file to include your API token and other critical settings. Secure configuration storage : Safeguard sensitive settings and access them from any location. : Safeguard sensitive settings and access them from any location. Controlled environment : Execute complex scraping tasks without impacting the performance of your local system. : Execute complex scraping tasks without impacting the performance of your local system. Scalability: Easily expand operations to handle larger datasets or more intricate workflows. By using a cloud desktop, you can create a reliable and scalable foundation for your data extraction activities, making sure consistent performance and security. The Evolving Potential of Gemini CLI As an open source tool, Gemini CLI continues to benefit from ongoing development and community contributions. Regular updates introduce new features, enhance compatibility with MCP servers, and improve overall functionality. For professionals seeking efficient and scalable data extraction solutions, Gemini CLI remains a valuable and adaptable resource. By staying informed about updates and actively engaging with the tool's development, you can ensure that your data extraction workflows remain at the forefront of technological advancements. Media Credit: Prompt Engineering Filed Under: AI, Guides Latest Geeky Gadgets Deals Disclosure: Some of our articles include affiliate links. If you buy something through one of these links, Geeky Gadgets may earn an affiliate commission. Learn about our Disclosure Policy.

Leads-Sniper.com Introduces Updated Web-Scraping Suite for Data-Driven Lead Generation
Leads-Sniper.com Introduces Updated Web-Scraping Suite for Data-Driven Lead Generation

Associated Press

time27-06-2025

  • Business
  • Associated Press

Leads-Sniper.com Introduces Updated Web-Scraping Suite for Data-Driven Lead Generation

Web Scraping Tools has released an updated set of web-scraping tools designed to help organizations collect publicly available business information from widely used online sources. The new release focuses on streamlining data extraction from Google Maps, Google Search, Yellow Pages directories, and business domains, giving sales and research teams a structured way to build prospect lists without manual copying. Key Components of the Updated Suite Focus on Responsible Data Use - emphasises that its tools are intended for collecting information already available in the public domain. The platform encourages users to follow applicable data-privacy regulations and best practices when storing, processing, or contacting individuals and businesses. 'Our goal is to reduce the time teams spend on repetitive data collection so they can concentrate on higher-value tasks such as analysis and relationship building,' a Leads-Sniper spokesperson said. 'These updates reflect feedback from users who need reliable, structured data delivered in an efficient manner.' Supporting a Range of Business Functions - Organisations employ web-scraped data for tasks such as territory planning, competitive mapping, supplier research, and targeted outreach. By automating extraction from familiar online properties, aims to give small firms and larger enterprises a consistent, reproducible workflow for assembling lead lists and monitoring market segments. About develops web-scraping software that helps businesses gather publicly available contact and market information for sales, research, and analysis purposes. The company's tools retrieve data from search engines, online directories, and domain-based sources, providing users with structured output for downstream use in CRM, analytics, or marketing platforms. For additional details on the latest release, visit or contact [email protected] Media Contact Company Name: Leads Sniper Contact Person: Sabrina Garret Email: Send Email Country: HongKong Website: Source: PRBoost

Cellebrite To Acquire Phone Forensics Startup Corellium For $200 Million
Cellebrite To Acquire Phone Forensics Startup Corellium For $200 Million

Forbes

time05-06-2025

  • Business
  • Forbes

Cellebrite To Acquire Phone Forensics Startup Corellium For $200 Million

Cellebrite and Corellium are providing new tools to police departments and intelligence agencies for getting data from cellphones. When trying to find a vulnerability in Apple iPhones or Android devices, many cybersecurity researchers now use a tool from Florida-based startup Corellium. Rather than risk breaking a physical device when they hack it, which they'd subsequently have to replace, they can create a virtual version of the phone in Corellium. Now, Cellebrite, one of the largest providers of phone forensics tools, has acquired Corellium for $200 million, a major merger that promises to give law enforcement unprecedented tooling for extracting data from seized electronics. The deal is a coup for founder and CEO Chris Wade, who in the last five years alone settled a major copyright lawsuit from Apple and received a pardon from President Trump for his role in providing proxy servers to a pair of spammers who were convicted of cybercrimes back in the mid-2000s. Wade avoided prison time, doing undercover work for the Department of Justice. 'The FBI and Department of Justice leaned on him to help secure the United States, that's a pretty bold testimonial.' Now, Wade will start a new chapter as the chief technology officer at Cellebrite, which is listed on the Nasdaq with a $3 billion market cap and posted over $400 million in revenue in 2024. The $200 million deal will consist of $150 million in cash, $20 million of restricted stock, and another $30 million in cash if certain, unspecified performance milestones are hit over the next two years. 'We've been a customer of Corellium for many years,' said Cellebrite CEO Tom Honan. As soon as he learned Wade was looking for a buyer earlier this year, Cellebrite 'jumped on that immediately and pursued being their ultimate home.' Wade told Forbes he was excited to work for a company whose technology is used on 1.5 million law enforcement investigations every year. 'That's a phenomenal statistic,' Wade said. 'Imagine the real world impact of that. That was something I wanted to be involved with.' Cellebrite and Corellium make for a good fit. Cellebrite offers a range of tools that come with the promise of accessing data on phones and PCs even when they're locked; its largest federal customer is Immigration Customs Enforcement (ICE), with its biggest order at $9.6 million in August last year. However, with devices like the iPhone continually adding layers of security, Cellebrite and rivals like Atlanta-based Grayshift have to find operating system flaws that can be exploited to allow them to bypass such barriers and get at data. Corellium's software makes finding those weaknesses easier by allowing the user to quickly spin up any make or model of a device within a PC app and test a given hack. For law enforcement, that means a cheaper and more efficient way to find exploits that could get them crucial evidence in an investigation. Corellium's software is also used by all manner of defensive and offensive cyber researchers probing software for vulnerabilities. While being sold direct into police agencies, Corellium will continue to be developed and sold to private customers like banking giant Santander and defense contractor L3Harris. The merged business also plans to debut a new beta product called Mirror that enables police to make a virtual version of a seized device and all the data that's on it. Wade thinks it'll help prosecutors show a jury exactly what's on a criminal's phone, presenting more compelling evidence compared to screenshots from technical-looking forensic tools. There's another benefit to Corellium's virtual devices. Sometimes forensics tools like Cellebrite's aren't compatible with certain mobile apps, meaning they won't retrieve data from them; Mirror will allow cops to look through those apps, says Wade, effectively giving them more complete access to what's on the device. Even before the deal, Cellebrite and Corellium had already been collaborating on an AI-powered service to detect government-made spyware on cellphones. The AI will look at a replicated version of a phone's operating system and identify 'deviations or any execution of foreign code on the device,' Wade said. 'This is something that's never been done before,' he said. 'It'll make it much easier to track down these kinds of state sponsored malware attacks.' While Wade's connections to Trump and the DOJ might turn heads, Cellebrite CEO Tom Hogan said he's not concerned about the optics. 'The fact that the United States government, the FBI and Department of Justice leaned on him to help secure the United States, that's a pretty bold testimonial,' Hogan added.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store