![]() ![]() ![]() This is because we generate each column independently of the others. First names do not match the expected Gender.Poor data qualityīelow are a few issues with our approach in the previous section that led to the poor quality of the test data. Second, the quality of the data we generated is poor for several reasons. That in itself is somewhat expected but certainly a negative. For the remainder, we had to write our own logic. First, Faker was only able to help us with a little more than half of our columns. There are several drawbacks to what we just did. Or we can generate an arbitrary number of rows by throwing this call into a loop:Īt this point, you could write this data to a CSV and import it into the database of your choosing. We can easily generate a new row of data by calling the following: Note that as of this writing there is a known issue in Windows 10 that causes exceptions to occasionally be raised when using the Faker date_time provider.īelow, you can find a python snippet that contains a mapping from each column name to a python lambda function which will generate the columns’ value. As a quick pass, let’s say we’d like to use the following faker providers on each column:įor the columns that don’t have an applicable provider, we’ll handle them ourselves leveraging python’s own random library. Normally, we’d do an analysis of the table to determine which columns are PII and which are not, but in this case, I’d like to be able to generate arbitrary amounts of data for this schema so I’ll need to create a generating function for each column. Once installed, we can start masking individual columns. Refer to the Faker documentation for more details on how to install Faker, but in short you can run: As you can see, the table contains a variety of sensitive data including names, SSNs, birthdates, and salary information. Our ‘production’ data has the following schema. Our goal will be to generate a new dataset, our synthetic dataset, that looks and feels just like the original data. The data we will use is a table of employees at a fictitious company. We obviously won’t use real data in this article we’ll use data that is already fake but we will pretend it is real. This article, however, will focus entirely on the Python flavor of Faker. It is also available in a variety of other languages such as perl, ruby, and C#. What is Fakerįaker is a python package that generates fake data. To accomplish this, we’ll use Faker, a popular python library for creating fake data. In this article we’ll look at a variety of ways to populate your dev/staging environments with high quality synthetic data that is similar to your production data. Restricting access to high quality data with which to build and test leads to a variety of issues, including making it more difficult to find bugs. New regulations around data privacy and an increasing awareness of the importance of protecting sensitive data is pushing companies to lock down access to their production data. ![]()
0 Comments
Leave a Reply. |