Azure Purview is a unified data governance service that helps you manage and govern your on-premises, multi-cloud, and software-as-a-service (SaaS) data. That’s the definition from Microsoft site. Azure Purview caught my interest since there are not very many PaaS/SaaS services that can effectively discover PII data at rest. PII stands for Personally Identifiable Information, just want to make sure you and I are thinking the same!
Azure Purview is very new and I am sure it’s features and capabilities would increase over time. Today, I wanted to do a sanity check to verify if the service indeed able to discover PII data in Azure File, Azure Blob and AWS S3. If you are in the financial domain, it is hard for you to avoid those storage containers. My special focus was Azure File but I wanted to include S3 to verify multi-cloud discovery claim by Microsoft! There is no best place but to start from Quick Start.
After you create Purview Account and go to Overview tab, you will get to see Open Purview Studio. We will keep both the Azure Portal and Purview Studio open side by side as we will need to configure changes on both sides. No matter which resource you want to connect for scanning, you will need some sort of credentials. So, first thing you do is, create a new Key Vault Connection under Management/Credentials section. We would be using Azure Key Vault to keep our secrets when Managed Identity can’t be used.
Purview has multiple options to connect to Azure Blob and Managed Identity is most appropriate when available. Azure File does not have the option for Managed Identity. So, we would use Account Key to access Azure Files. Purview allow you to create credential that is stored in Key Vault. Follow Credentials for source authentication in Azure Purview to create a credential required to connect to Azure File.
Once you have the credential, you can register data sources under Data Map > Sources > Register. We registered Azure File, Azure Blob and AWS S3. You don’t need the credential to register data sources but you do need it when running the scan. You will need to create a IAM role in AWS and you can find related documentation at Amazon S3 Multi-Cloud Scanning Connector for Azure Purview.
Okay, key vault connection is set, credentials are created, resources are registered and now we are ready to start the scan! Create new scan for each resource registered. Purview will queue the scans and you will get to see the results once scans are done. If you encounter “Failure to connect to data source” error message when trying to connect, it is likely that either key vault is not accessible or you have not granted IAM permission on the resource or storage account is not reachable due to firewall configuration.
Go to Data Catalog > Browse Assets > By Collection and select the collection. You will get to see assets on right pane and you can narrow the result by selecting classification.
I used DLP Testing Data Generator at Venkon Cloud to generate test data and uploaded them in Azure File, Blob and AWS S3 bucket before running the scans. Purview indeed found them. We are able to use Azure Purview to discover PII assets in Azure and AWS cloud. That’s nice!
Okay, Purview works but some observations- it takes long to scan too small files and I can’t tell how many instances of PII data found in a given file. I have done testing with private endpoints but Purview provides private endpoint feature. Overall, I am happy to see Microsoft finally came with a product to discover PII data in multi-cloud and it works.
Note: DLP Testing Data Generator that I used is no longer working. For your convenience, I have made those data files available here-