Data Obfuscation or "Masking"
Recent surveys have indicated that in many cases sensitive information has not been compromised on production computers, but on development, test or training databases that were preloaded with production data, including the sensitive data elements. Often, the security of these environments do not match the level of data protection that is afforded the production system, so these environments become easy targets for malicious individuals who are looking for sensitive data.
Why would anyone put production data development and test computers?
It is much easier to set up a production copy for testing than if you were to create test data from scratch. Here are a few reasons why:
- Volume improves the quality and thoroughness of application testing. If you generate one hundred made-up test records, you might be able to test many of the system’s functions and conditions, but your testing may not be as thorough as if you were testing against the hundreds of thousands of records that the production database may hold.
- Creating realistic values for each data element can be complex and time consuming. Test data element values may need to contain values in specific ranges or checksum characters that are included to verify the data value, or be subject to other restrictions.
- Information in one database may need to link to information in other database tables or even other databases, so any fabricated, key fields in the test database must be duplicated across all tables and databases that use those data elements.
But even with the above benefits, using production data for development, testing or training purposes is a bad idea.
What is data masking and how can it help?
Data masking is a mechanism that creates a copy of a database within which the values of potentially sensitive data elements, such as names, social security numbers, salaries, grades, are altered so that the original values are no longer available in the database copy and cannot be determined by applying any formula to the masked value.
Data masking tools provide a variety of data masking methods. The following list that includes a tiny fraction of the masking methods that can be defined for each field to be masked. Basically, you can set a data element that you want to mask with:
- The same value for all instances
- Purely random numbers or characters for each instance of a data element
- Random values whose values are restricted by some formula (e.g., valid SSN values do not include all nine digit numbers)
- A randomly selected value from a table of values
- A value built from randomly selected values from other tables. For example, a new name field could be generated by taking a random first name from a table and combining it with a random last name to create a new name for each instance
- A value based upon one or more data elements that have already been masked, e.g., the generation of a masked name field value may affect the generation of the gender field value, the generation of the city and state data element values may affect the generation of the zip code value
Data masking software keeps track of relationships
If you are attempting to mask a data element that can be found in multiple tables across the database, how would you ensure that the masked data element is one table matches the corresponding data element in another?
This is the real power of data masking products. When the database is defined to the data masking software, the software executes a discovery process where it builds a reference table of data element relationships. In a simple example, let’s assume that the tables in a database are linked by social security number, a data element that we want to mask. The discovery process would find instances of social security number across all of the tables in the database. Then, when the social security number 123-45-6789 is masked to become 993-11-5656, the data masking software will replace every instance of social security number 123-45-6789 in the database with 993-11-5656. This way the data element relationships are preserved.
In cases where discovery misses a data element relationship, e.g., possibly because a data element name does not suggest that it holds a social security number but it actually does, the program allows the data masking administrator to specify additional data element relationships manually.
In a nutshell ...
Data masking provides the following information security, application development, testing and training benefits:
- Original data values for masked fields are not held in the database
- The data masking process cannot be reversed, so sensitive data is protected even if the masked database is stolen
- Developers and testers can test against large numbers of database records with all of the complexity of the production environment
- Trainers can generate sample transactions that mimic existing, complex relationships without having to create them manually
- The masked database can be refreshed over and over again by passing the production database through the data masking program
- As new relationships are built into the production database, they can easily be carried over to the masked database
- Masked databases can include as many records as the production database or only a fraction to keep the test bed more manageable