An In-Depth Guide to Data Masking

Applications have become more advanced than ever. They're not just used to get or share information. Applications today are capable…

Author Avatar
By Erin,

Applications have become more advanced than ever. They’re not just used to get or share information. Applications today are capable of doing almost everything—ordering products, writing an essay for you, completing financial transactions, and much more. To be able to do all this, applications need to deal with sensitive data.

Sensitive data attracts malicious actors who try to steal your data. So, when sensitive data is involved, you need to take care of data security. One very effective yet simple technique is data masking. In this post, we’ll discuss why data masking is important and then see some examples of using it.

What is Data Masking?

Data masking hides sensitive data by replacing it, or a part of it, with fake data. The idea is to maintain the structure of actual data without providing sensitive data.

For example, many applications store users’ contact numbers. If the user forgets their password to log in, the application will send a one-time password (OTP) to the user’s phone, and the user can then reset their password. To get an OTP, the user provides their username or email address and verifies the number to which the OTP will be sent. But here’s the problem: If user Tony has forgotten his password, he would provide his email address, confirm the phone number, and get OTP. But what if some other user, say Steve, knew Tony’s email address? If Steve used the forgot password option and entered Tony’s email address, then when the app tries to confirm Tony’s phone number, Steve would see the number.

This is a privacy breach. To avoid such situations, we can show a part of the actual data. If the phone number has 10 digits, we can display the last 3 digits and mask the first 7: XXXXXXX452. This way, Tony can verify his phone number, but Steve can’t get it.

Data redaction is another technique for hiding sensitive data. Redaction means blacking out all the sensitive data. Here’s a screenshot from Google Cloud that shows what data redaction looks like.

Why is Data Masking Important?

In a word, data masking is important for data security. But let’s dive a little deeper and understand why it’s beneficial.

Insider Threat

And insider threat comes from inside an organization. It can be an employee with malicious intent or an employee who’s not following security practices. Data is stored in databases, and some employees of the company will have access to all data. For example, the customer support people may have access to users’ credit card numbers and SSN. But you need to control what data they can see and what is hidden. Customer service employees might need access to this database to help customers, but it’s not secure to view all of the customers’ data. In such cases, data masking can help secure customers’ data and protect privacy.

Data Breach

If some malicious actor gets access to a system’s database, they could copy the data to a different location and demand a ransom. Using data masks can help avoid such situations because it’s useless if the malicious actor gets hold of masked data. Data masking will not prevent a data breach, but it will limit the effects of a data breach.

Production Database Protection

In this case, the malicious actor need not get access to the database itself. They could get access to data from the client-side of an application or when data is moving from the server to the client. The example of Tony and Steve mentioned above demonstrates how poor practices can lead to illegitimate data access.

Compliance

There are many regulations for data security that require data masking to be compliant. The General Data Protection Regulation (GDPR) is one example created to provide data security. But even if there’s no data incident at your organization, if you aren’t compliant with regulations, you might get into legal issues and end up paying fines. Data masking helps you prevent this situation.

Now that you understand why data masking is important let’s see where we commonly use it.

When Do We Use Data Masking?

We use data masking in two main cases: testing and production.

Test Data Management

To build a robust application, developers should know what kind of data they’re dealing with, its structure, and how the application behaves when dealing with real data. But neither developers nor testers need to see the actual sensitive data. They both can work with data that is similar to the actual data. That’s where data masking comes in. By using masked data, developers and testers can build and test products without worrying about privacy issues. They can also cover all the cases to build and test the product without risking a privacy breach.

Most of us use automated testing tools today because they’re fast and accurate. To use those tools, we don’t need to create a separate database of masked data. Tools like Testim provide all the important features needed for testing applications, along with a custom step to mask sensitive data (black it out) before taking a screenshot. This way, the tool tests the application and generates reports as usual, but the results include no sensitive data.

Applications Using Sensitive Data

While many applications deal with sensitive data, it’s not always necessary for end-users to see it. For example, e-commerce applications can let users save credit card details to prevent entering them every time they purchase something. Once they’ve saved the card details, they don’t need to see the complete details for every transaction. Just seeing part of the details, say the last 4 digits of the card, is enough to identify which card they’re paying with. That’s an instance where data masking can provide protection.

Why do we need to display part of the details and not mask everything? Imagine that a user needs to pay with a specific card and have multiple cards in their profile. If you mask all details, they wouldn’t know which card they’re paying with. But displaying the last 4 digits helps them select or identify a card. Hence, this is a common data masking practice in applications.

Now let’s get to the hands-on part and see some examples of data masking in action.

Data Masking Examples

First, here’s a simple Python example to demonstrate how data masking works and how to implement it in a database.

Here’s the Python code:

users = [{'Name':'Tony', 'Country': 'France', 'Card Number':'3542-7583-7228-3788'},
{'Name':'Steve','Country':'Austria', 'Card Number':'3881-8829-5554-4875'},
{'Name':'Peter', 'Country':'Spain', 'Card Number':'8445-5556-9621-9962'}]


print("Name\t\tCountry\t\tCard Number")
for user in users:
    print (user['Name']+"\t\t"+user['Country']+"\t\t"+user['Card Number'])

First, I create a list of dictionaries with user names, countries, and card numbers. Then I use a loop to display these details. I’m not masking any data here yet. You can use any Python IDE, CLI, or online Python interpreters to execute this. I’m using the CLI. Let’s look at the output of this code.

Name, country and card numbers

As you can see, the card number, which is sensitive information, is clearly visible. Now let’s change the code slightly to mask the card details.

users = [{'Name':'Tony', 'Country': 'France', 'Card Number':'3542-7583-7228-3788'},
{'Name':'Steve','Country':'Austria', 'Card Number':'3881-8829-5554-4875'},
{'Name':'Peter', 'Country':'Spain', 'Card Number':'8445-5556-9621-9962'}]


print("Name\t\tCountry\t\tCard Number")
for user in users:
    card_number = 'XXXXXXXXXXXXXX'+user['Card Number']
[-4:]
    print (user['Name']+"\t\t"+user['Country']+"\t\t"+card_number)

I’ve changed all characters except the last 4 digits of the card number to ‘X,’ which acts as a mask in this code. Let’s look at the output.

Name, country, card numbers that show only last digits

This shows how data masking prevents sensitive information from being completely visible. In this case, data was hard-coded in the code itself, which is never the case in practical applications. Data is fetched from the database. Now let’s see how we can use a data mask while getting data from the database.

Data Masking in Database

I’ll be using MariaDB for this example, which comes pre-installed with the Kali Linux operating system. I start the service by running the following command:

service mysql start

First, let’s create a database and add data.

mysql -u root
create database sample;
use sample;
create table users(Name varchar(10),Country varchar(10), Card_Number varchar(20));
insert into users values ('Tony','France','3542-7583-7228-3788'),('Steve','Austria','3881-8829-5554-4875'),('Peter','Spain','8445-5556-9621-9962');
select * from users;

Visible data of name, country and card numbers

You can see that when I use the select function, all data is clearly visible. Usually, applications use a select query to fetch data from a database. How can we change this to get masked data? MySQL database has many built-in functions for data masking. I’ll use a simple string-replace approach. The query is as follows:

SELECT Name,Country,CONCAT(REGEXP_REPLACE (LEFT(Card_Number,16), '.', 'X'),RIGHT(Card_Number,4)) from users;

The logic here is simple. We fetch data for all columns, but for the card number, we replace the first sixteen characters with ‘X’ and concatenate the remaining four characters to the replaced string. The output looks something like this:

Name, country and card number that has been data masked

That’s how simple masking is!

Data Masking Best Practices

There are various approaches to data masking, and we need to follow the most secure approaches. We’ve gone through different aspects of data masking and learned how important and easy it is. I’ll conclude with some best practices for data masking.

  1. Find and mask all sensitive data. If you have different databases and places where you store sensitive data, find and mask all of them.
  2. Mask data at the origin. If you think masking data only on the client-side is enough, you’re wrong. Various tools can intercept traffic before it reaches the client application. If you’re masking data after it reaches the client, malicious actors can still get hold of unmasked data from the network.
  3. Use irreversible data masking techniques. The whole point of data masking is to protect sensitive data. If users can convert masked data back to original data, there’s no point in masking it. For example, masking digits with alphabetic characters at the associated positions (1->a, 2->b, 3->c, etc.) is not a secure approach because users can reverse it.

In conclusion, data masking is a simple yet effective technique for data security. If your organization deals with sensitive data, data masking is a must.

What to read next:

Understanding Data Obfuscation: What Developers Need to Know

Which Data Masking Tools Should You Choose in 2021?