WEBINAR: Why Test Automation Fails: Test Design and Implementation Tips REGISTER NOW

Test Data Is Critical: How to Best Generate, Manage, and Use It

When it comes to testing, data is king! In an increasingly digitized world, data is becoming more important for many…

By Testim,

When it comes to testing, data is king! In an increasingly digitized world, data is becoming more important for many businesses. However, many people aren’t aware that data plays an important role in testing. Why? Because the best way to test your application is with real data.

Of course, you should avoid using production data directly. Instead, use test data. This term refers to the generation of data that comes as close as possible to your production data without revealing any sensitive information.

Why does the accuracy and structure of test data matter? Well, it doesn’t make sense to test your software with completely meaningless data. This might sound harsh, but it’s probably better not to perform testing at all with such data, as you might be testing nothing. Meaningless data doesn’t add any value to the quality of your application. Therefore, test data needs to be meaningful for you to perform worthwhile tests without revealing private information.

In addition, with the rise of automated testing, there’s little room for manual test data creation. Continuous testing has gained a lot of attention within the DevOps and software development communities. This part of the testing strategy involves generating test data on the fly while running test cases. Basically, your testing strategy relies on a script that can generate the ideal testing data for you and your projects.

This article explains how to safely generate test data, how to manage that test data, and last, how to use the test data.

Generating High-Quality Test Data

To perform quality testing, you’ll need data of high quality. The goal of test data generation is to generate meaningful, connected, and interrelated data.

According to the State of DevOps 2019 report by Redgate, 65% of companies copy their production data to be used in testing. It’s quite worrying that only 36% of these companies apply masking techniques to protect the information they’re using from hackers. However, basing your test data on your production data is actually not a bad approach after all.

Production data represents an accurate image of the data that works well with your application. This means that this data is well suited for running test cases. Of course, it’s important to mask or substitute any sensitive data to avoid disclosing any personally identifiable information.

Furthermore, other approaches to generate test data include:

  • Generate data directly to a database.
  • Prepare CSV or JSON files that contain data to be used by scripts or test cases.
  • Generate data by interacting with a front end. You can manually generate test data by interacting with the front end and exploring advanced paths.
  • Web scraping can be a great technique to extract real data for testing your application. However, make sure only to scrape the data that you’re allowed to.

The above four strategies require more effort than continuous testing does, though. Ideally, you want test data generation to be part of your test automation strategy. But what’s the best approach to generating test data?

Generating Test Data Using Different Techniques

It’s essential to have a mix of different strategies to generate test data. If you are only using a single strategy to generate test data, you might end up testing the same cases over and over.

Consequently, using different methodologies to generate your test data will help you have a rich set of testing data. For instance, you can combine manually generated test data by interacting with your frontend with data generated by an automated tool. You want to make sure the test data covers all possible test scenarios, also the so-called negative scenarios. A negative scenario tries to verify failure paths in your application to see how the application handles invalid actions or input.

test data

If you are generating test data, the following three properties will make sure you generate high-quality test data each time:

  1. Accurate data: Data should be realistic and resemble real-life situations. You don’t want to fill in a date that lies 100 years in the future.
  2. Valid data: Data should match the purpose of your tool. If you have a webshop, don’t test for crazy scenarios where someone would buy 200 items. Make sure to simulate valid scenarios where a user buys one to ten products at once.
  3. Exceptions data: Make sure that your data also covers exceptions. For instance, a user returned a product to a webshop and received a coupon code for its next purchase. Make sure to cover the scenario where a user checks out using a coupon code only. It’s a clear exception scenario that deserves testing.

Next, let’s explore four common pitfalls you may encounter when generating test data that’s based on your production data.

4 Common Pitfalls When Using Production Data as Test Data

Using your production data might be a smart approach for your organization to generate test data. However, many organizations forget about the limitations of this data. Here are four common pitfalls companies encounter when basing their test data on production data.

Pitfall 1: Missing Data

When the development team creates new functionality, this might introduce new data that’s being captured. This means that you have new tables in your database for which you don’t have any sample data. When you’re blindly copying production data to be used as test data, you might forget about these new data tables.

Therefore, analyze if any new data has been introduced that your testing engineers need to generate.

Pitfall 2: Production Data That Follows the Happy Path

For testing engineers, the “happy path” is a common term that refers to testing only the success scenarios. This happy path is also easy to find in production data, as every action that a user completes should be successful. Considering this, your production data might not be the ideal data set to use for testing. You’ll likely have to create data for negative scenarios as well, so you can test failures.

Pitfall 3: Testing Edge Cases

Your production data often doesn’t represent any edge cases. Because the production data represents the happy path, you won’t find many edge cases or advanced flows in your data. This might be an issue if you want to test all possible scenarios to reach 100% test coverage. Your production data might test only 70% to 80% of the scenarios.

In short, you won’t reach 100% test coverage solely with production data. Your application requires additional data to represent advanced flows. Furthermore, generating this data might require a more manual approach.

Pitfall 4: Generating Data Without a Testing Purpose

Testing engineers often forget to determine the purpose of the data they are generating. What are you trying to test with the data you’ve generated? There are several clear purposes you can adopt when generating test data:

  • Data for white box testing: The test data should cover as many code paths as possible for your application, even negative paths. Therefore, you want to pass invalid parameters or invalid combinations to see how your application responds.
  • Data for security testing: There’s a big difference between data for testing all code paths and testing security issues. Data for security testing is often much more sophisticated to uncover security issues. For instance, you want to verify if only authorized people can access your system. It’s much different from generating data that verifies if the login form works as expected.
  • Data for black box testing: Black box testing focuses on verifying the application’s behavior without knowing anything about the application or code itself. Therefore, you want to generate a wide variety of data to test as many problems and cases as possible to find bugs or issues. For example, you want to generate different birth date formats to verify how a form reacts when you pass formats it doesn’t expect.

Test Data Management: How Can You Do It?

Test data management includes many aspects, such as removing personally identifiable information and performing data validity checks. Here are four approaches you should follow to manage your test data accordingly. Each approach is equally important when managing your test data.

Approach 1: Remove Any Personally Identifiable Information

First, check if your data contains any personally identifiable information (PII). If so, apply data masking techniques such as substitution, shuffling, or blurring. These techniques help you to make data non-identifiable.

Next, check the validity of your test data regularly.

Approach 2: Perform a Data Validity Check

As development moves forward and you or your team members add new features, your data should move forward too. Therefore, perform test data audits regularly to find outdated data. Furthermore, validate if any data is missing to support new functionality. As mentioned earlier in this article, you might end up introducing new features, meaning that you also need new data tables.

Approach 3: Refresh Your Test Data Regularly

Besides checking the validity of your data, it’s important to regularly refresh your data. This process can be easily automated with scripts that help you generate new data. This buyer’s guide can help you evaluate automated testing solutions.

Refreshing your test data can improve the quality of your application. Different data might expose bugs that your team hasn’t discovered yet with previous test data. Therefore, it’s important to make the time to regularly update your test data.

test data

Approach 4: Manage Data Access

Last of all, manage data access. Your organization needs to know how to access all important data. In addition, to ensure smooth testing, make sure your testing engineers always have access to the required data. You don’t want to slow down a release because of data inaccessibility.

Tip: Consider creating a list of data sources you need for testing and where they are located. This helps the testing engineers to easily find test data.

How Can You Use Test Data?

First of all, always make a copy of your test data before using it. That way, if something goes wrong, you can still access the original test dataset.

Next, you can use test data in various ways. Scripts can convert test data into different formats or insert the data into a database. For example, you may want to directly inject the data into a test database in order to test whether the application runs correctly.

After you run the test cases, you can do several things with your test data:

  • Store the final state of your database as a reference.
  • Delete all test data to avoid confusion about which is the original test data file. Also, clean the imported or outputted test files in your application. It can be a tricky process to clean everything accordingly. For example, output files may be hidden in several places in your tests. It’s easy to miss a couple of files when cleaning them up.
  • Use the end state of your database as input for further testing.


As you can see, test data management includes many elements, from data generation to data management and usage. It can be challenging to get all data aspects right. The most important question to solve is whether you should use production data to generate test data. Using production data as test data saves your organization a lot of time. However, this approach does not offer full test coverage and only covers the happy path.

This post was written by Michiel Mulders. Michiel is a passionate blockchain developer who loves writing technical content. Besides that, he loves learning about marketing, UX psychology, and entrepreneurship. When he’s not writing, he’s probably enjoying a Belgian beer!

Testim's latest articles, right in your inbox.

From our latest feature releases, to the way it impacts the businesses of our clients, follow the evolution of our product

Blog Subscribe