Sensitive Data Discovery: The Inventa Way

Published On: November 22, 2022Categories: Blog

In this blog I’ll describe how Inventa does sensitive data discovery – how it discovers, merges, indexes and catalogs sensitive personal information across an enterprise for use in privacy and security applications. We think Inventa does sensitive data discovery better than any other solution on the market, and hopefully after reading this, you’ll understand why.

PII is like a Virus

For starters, Inventa doesn’t need to know where all your data is before we do sensitive data discovery. In fact, this is one of the most difficult and time-consuming parts of most PII discovery efforts. The analogy I like to use is anti-virus – you wouldn’t tell your anti-virus system where to look for viruses, and you shouldn’t have to tell your PII data discovery solution where to look either.

Agentless Data Discovery

Additionally, Inventa is an extensible Docker container-based solution designed to scale out to wherever you may need to discover sensitive data. There are two components to the solution – a set of deployable sensors that connect to repositories via client connections, and a console manager where metadata about discovered data is organized and displayed. Inventa does not deploy agents.

Identifying Candidate Repositories

The first part of the ongoing process is to take a passive copy of available network traffic. This can be done using a network span or tap or a smart sniffing technology such as Gigamon or IXIA. Many organizations have this capability in place to support existing security and network operations tools. Complete network coverage is not required but covered network segments will provide better visibility.

From the network traffic, we build a list of candidate repositories for scheduled, periodic scanning. We identify all sorts of devices that may store or process sensitive personal data. This may very well include repositories previously thought not to contain sensitive data. From the network traffic, we can identify types of repositories, protocols used and in many cases the repository vendor information. A dynamic, virtual CMDB of sorts is created and maintained automatically and continuously by Inventa. This CMDB will be utilized later when Inventa scans data at rest.

Sensitive Personal Data in Motion

We also use the copy of network traffic to discover, identify and categorize sensitive personal data in motion, traversing the network. This data can take many forms – it may be an application, or database transaction, a web query, or simply a file transfer. When Inventa identifies content of interest, we collect all the relevant packets and reconstruct them to interrogate the message in the same way we discover and catalog data at rest. This allows Inventa to identify how specific data is shared between repositories, databases and applications in the organization. It also allows Inventa to identify sensitive data shared with external or internal third parties, critical for mitigating 3rd party risk.

Handling Encrypted Data

One question I get often is how do you handle encrypted data? Inventa is not doing some sort of “man in the middle” attack on the traffic – if traffic is encrypted then Inventa will be limited to understanding header information. However, this header information still provides value to Inventa in terms of how repositories communicate and share information with one another. Additionally, Inventa can make use of decrypted ICAP traffic from proxy devices or raw packets from SSL termination devices if available in the environment.

Sensitive Personal Data at Rest

The third, and most data-intensive part of the process is scanning data at rest in a wide range of types of repositories.

The bulk of the data that Inventa discovers and catalogs comes from regular, periodic, at-rest scanning of the candidate repositories identified by the network traffic, as well as any additional repositories configured manually. Permissible scanning windows can be established to scan structured and unstructured data in all sorts of repositories such as conventional file systems, databases and cloud data repositories.

Inventa connects natively to each repository and scans the data using entity extraction techniques. In the case of structured data repositories such as SQL databases, Inventa analyzes the structure of the database prior to the data content scan to understand how to most effectively and efficiently discover sensitive data in the database. In keeping with the idea that most customers don’t know where all their data is, Inventa does not need to know where in the database sensitive data resides prior to scanning the database. This is a significant manual cost for most data discovery efforts.

In the case of unstructured data, Inventa leverages natural language processing (NLP) and GPU-accelerated artificial intelligence and machine learning to understand the context of discovered sensitive data. This enables Inventa to identify additional categories of sensitive information and significantly improve the accuracy of discovery.

The bits and pieces of sensitive personal data in both structured and unstructured files and repositories across the enterprise that we find at this point in the process we call candidate PII.

Candidate PII in Context

The discovered data is now validated against known lists of individuals the organization has a relationship with. We call these lists “Data Assets,” but you can think of them as virtual groups of customers, employees, or any other subset of humans with a defined business relationship to the company.

So, in this case, Inventa has collected many pieces of sensitive personal information about John Smith from across the enterprise – from files, logs, transactions, databases, etc. But at this point, Inventa can’t yet tell if John Smith is a customer or an employee or neither. Additionally, there may be multiple John Smiths in the organization and each of their sets of data needs to be associated appropriately. The regulatory and risk management requirements may differ significantly from group to group, so this context also is critical for making decisions about how to protect John’s information.

Inventa then combines all this data with each Data Asset to group data into relationships that the business understands. In this case our Data Asset is the list of 3rd Party Customers for this organization of which John Smith is a member. We then can confirm that all the bits of John Smith’s data are in fact PII that the organization cares about.

From Candidate PII to Managed PII

This verified data we call managed PII. There are no limitations to the number of Data Assets you can create and in fact it would be quite normal for any given individual to be a member of more than one – for instance, John Smith in addition to being a 3rd Party Customer, is also a Customer and perhaps a North American customer. These are simply three virtual groupings to which this John Smith belongs.

Inventa catalogs and organizes all this managed PII data including all the copies and partial copies of the data to be available for data subject access requests, vulnerability management, risk assessments, data protection programs, security audits, and other risk management and privacy purposes.

Once John’s data has been associated with the 3rd Party Customer grouping and John data is confirmed as managed PII, Inventa organizes this data around this grouping, the individual, the repositories where it was found, and the type of data that was discovered. This data can be investigated from different starting points depending on the specific use case requirement.

For example, take the privacy use case of collecting data for a data subject access request response. Starting with a unique identifier for John Smith such as an ID or email address, we can see all the transactions, files, databases, repositories and other data assets John is associated with. This allows the privacy user to identify where all the copies and partial copies of John Smith’s data reside and with whom it has been shared. This normally time-consuming data gathering exercise is automatically maintained as part of the platform’s ongoing discovery and consolidation process.

Security and Risk use cases can be approached in a similar fashion usually starting from the repository, data asset, or data element point of view.

I hope this has helped you get a better understanding of how’s Inventa platform continuously discovers, merges, indexes and catalogs sensitive personal data.