6.4.7.3 Using MySQL Enterprise Data Masking and De-Identification

Before using MySQL Enterprise Data Masking and De-Identification, install it according to the instructions provided at Section 6.4.7.2, “Installing or Uninstalling MySQL Enterprise Data Masking and De-Identification”.

To use MySQL Enterprise Data Masking and De-Identification in applications, invoke the functions that are appropriate for the operations you wish to perform. For detailed function descriptions, see Section 6.4.7.4, “MySQL Enterprise Data Masking and De-Identification User-Defined Function Reference”. This section demonstrates how to use the functions to carry out some representative tasks. It first presents an overview of the available functions, followed by some examples of how the functions might be used in real-world context:

Masking Data to Remove Identifying Characteristics

MySQL provides general-purpose masking functions that mask arbitrary strings, and special-purpose masking functions that mask specific types of values.

General-Purpose Masking Functions

mask_inner() and mask_outer() are general-purpose functions that mask parts of arbitrary strings based on position within the string:

  • mask_inner() masks the interior of its string argument, leaving the ends unmasked. Other arguments specify the sizes of the unmasked ends.

    mysql> SELECT mask_inner('This is a string', 5, 1);
    +--------------------------------------+
    | mask_inner('This is a string', 5, 1) |
    +--------------------------------------+
    | This XXXXXXXXXXg                     |
    +--------------------------------------+
    mysql> SELECT mask_inner('This is a string', 1, 5);
    +--------------------------------------+
    | mask_inner('This is a string', 1, 5) |
    +--------------------------------------+
    | TXXXXXXXXXXtring                     |
    +--------------------------------------+
  • mask_outer() does the reverse, masking the ends of its string argument, leaving the interior unmasked. Other arguments specify the sizes of the masked ends.

    mysql> SELECT mask_outer('This is a string', 5, 1);
    +--------------------------------------+
    | mask_outer('This is a string', 5, 1) |
    +--------------------------------------+
    | XXXXXis a strinX                     |
    +--------------------------------------+
    mysql> SELECT mask_outer('This is a string', 1, 5);
    +--------------------------------------+
    | mask_outer('This is a string', 1, 5) |
    +--------------------------------------+
    | Xhis is a sXXXXX                     |
    +--------------------------------------+

By default, mask_inner() and mask_outer() use 'X' as the masking character, but permit an optional masking-character argument:

mysql> SELECT mask_inner('This is a string', 5, 1, '*');
+-------------------------------------------+
| mask_inner('This is a string', 5, 1, '*') |
+-------------------------------------------+
| This **********g                          |
+-------------------------------------------+
mysql> SELECT mask_outer('This is a string', 5, 1, '#');
+-------------------------------------------+
| mask_outer('This is a string', 5, 1, '#') |
+-------------------------------------------+
| #####is a strin#                          |
+-------------------------------------------+
Special-Purpose Masking Functions

Other masking functions expect a string argument representing a specific type of value and mask it to remove identifying characteristics.

Note

The examples here supply function arguments using the random value generation functions that return the appropriate type of value. For more information about generation functions, see Generating Random Data with Specific Characteristics.

Payment card Primary Account Number masking.  Masking functions provide strict and relaxed masking of Primary Account Numbers.

  • mask_pan() masks all but the last four digits of the number:

    mysql> SELECT mask_pan(gen_rnd_pan());
    +-------------------------+
    | mask_pan(gen_rnd_pan()) |
    +-------------------------+
    | XXXXXXXXXXXX2461        |
    +-------------------------+
  • mask_pan_relaxed() is similar but does not mask the first six digits that indicate the payment card issuer unmasked:

    mysql> SELECT mask_pan_relaxed(gen_rnd_pan());
    +---------------------------------+
    | mask_pan_relaxed(gen_rnd_pan()) |
    +---------------------------------+
    | 770630XXXXXX0807                |
    +---------------------------------+

US Social Security number masking.  mask_ssn() masks all but the last four digits of the number:

mysql> SELECT mask_ssn(gen_rnd_ssn());
+-------------------------+
| mask_ssn(gen_rnd_ssn()) |
+-------------------------+
| XXX-XX-1723             |
+-------------------------+
Generating Random Data with Specific Characteristics

Several functions generate random values. These values can be used for testing, simulation, and so forth.

gen_range() returns a random integer selected from a given range:

mysql> SELECT gen_range(1, 10);
+------------------+
| gen_range(1, 10) |
+------------------+
|                6 |
+------------------+

gen_rnd_email() returns a random email address in the example.com domain:

mysql> SELECT gen_rnd_email();
+---------------------------+
| gen_rnd_email()           |
+---------------------------+
| [email protected] |
+---------------------------+

gen_rnd_pan() returns a random payment card Primary Account Number:

mysql> SELECT gen_rnd_pan();

(The gen_rnd_pan() function result is not shown because its return values should be used only for testing purposes, and not for publication. It cannot be guaranteed the number is not assigned to a legitimate payment account.)

gen_rnd_ssn() returns a random US Social Security number with the first and second parts each chosen from a range not used for legitimate numbers:

mysql> SELECT gen_rnd_ssn();
+---------------+
| gen_rnd_ssn() |
+---------------+
| 912-45-1615   |
+---------------+

gen_rnd_us_phone() returns a random US phone number in the 555 area code not used for legitimate numbers:

mysql> SELECT gen_rnd_us_phone();
+--------------------+
| gen_rnd_us_phone() |
+--------------------+
| 1-555-747-5627     |
+--------------------+
Generating Random Data Using Dictionaries

MySQL Enterprise Data Masking and De-Identification enables dictionaries to be used as sources of random values. To use a dictionary, it must first be loaded from a file and given a name. Each loaded dictionary becomes part of the dictionary registry. Items then can be selected from registered dictionaries and used as random values or as replacements for other values.

A valid dictionary file has these characteristics:

  • The file contents are plain text, one term per line.

  • Empty lines are ignored.

  • The file must contain at least one term.

Suppose that a file named de_cities.txt contains these city names in Germany:

Berlin
Munich
Bremen

Also suppose that a file named us_cities.txt contains these city names in the United States:

Chicago
Houston
Phoenix
El Paso
Detroit

Assume that the secure_file_priv system variable is set to /usr/local/mysql/mysql-files. In that case, copy the dictionary files to that directory so that the MySQL server can access them. Then use gen_dictionary_load() to load the dictionaries into the dictionary registry and assign them names:

mysql> SELECT gen_dictionary_load('/usr/local/mysql/mysql-files/de_cities.txt', 'DE_Cities');
+--------------------------------------------------------------------------------+
| gen_dictionary_load('/usr/local/mysql/mysql-files/de_cities.txt', 'DE_Cities') |
+--------------------------------------------------------------------------------+
| Dictionary load success                                                        |
+--------------------------------------------------------------------------------+
mysql> SELECT gen_dictionary_load('/usr/local/mysql/mysql-files/us_cities.txt', 'US_Cities');
+--------------------------------------------------------------------------------+
| gen_dictionary_load('/usr/local/mysql/mysql-files/us_cities.txt', 'US_Cities') |
+--------------------------------------------------------------------------------+
| Dictionary load success                                                        |
+--------------------------------------------------------------------------------+

To select a random term from a dictionary, use gen_dictionary():

mysql> SELECT gen_dictionary('DE_Cities');
+-----------------------------+
| gen_dictionary('DE_Cities') |
+-----------------------------+
| Berlin                      |
+-----------------------------+
mysql> SELECT gen_dictionary('US_Cities');
+-----------------------------+
| gen_dictionary('US_Cities') |
+-----------------------------+
| Phoenix                     |
+-----------------------------+

To select a random term from multiple dictionaries, randomly select one of the dictionaries, then select a term from it:

mysql> SELECT gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities'));
+---------------------------------------------------------------+
| gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities')) |
+---------------------------------------------------------------+
| Detroit                                                       |
+---------------------------------------------------------------+
mysql> SELECT gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities'));
+---------------------------------------------------------------+
| gen_dictionary(ELT(gen_range(1,2), 'DE_Cities', 'US_Cities')) |
+---------------------------------------------------------------+
| Bremen                                                        |
+---------------------------------------------------------------+

The gen_blacklist() function enables a term from one dictionary to be replaced by a term from another dictionary, which effects masking by substitution. Its arguments are the term to replace, the dictionary in which the term appears, and the dictionary from which to choose a replacement. For example, to substitute a US city for a German city, or vice versa, use gen_blacklist() like this:

mysql> SELECT gen_blacklist('Munich', 'DE_Cities', 'US_Cities');
+---------------------------------------------------+
| gen_blacklist('Munich', 'DE_Cities', 'US_Cities') |
+---------------------------------------------------+
| Houston                                           |
+---------------------------------------------------+
mysql> SELECT gen_blacklist('El Paso', 'US_Cities', 'DE_Cities');
+----------------------------------------------------+
| gen_blacklist('El Paso', 'US_Cities', 'DE_Cities') |
+----------------------------------------------------+
| Bremen                                             |
+----------------------------------------------------+

If the term to replace is not in the first dictionary, gen_blacklist() returns it unchanged:

mysql> SELECT gen_blacklist('Moscow', 'DE_Cities', 'US_Cities');
+---------------------------------------------------+
| gen_blacklist('Moscow', 'DE_Cities', 'US_Cities') |
+---------------------------------------------------+
| Moscow                                            |
+---------------------------------------------------+
Using Masked Data for Customer Identification

At customer-service call centers, one common identity verification technique is to ask customers to provide their last four Social Security number (SSN) digits. For example, a customer might say her name is Joanna Bond and that her last four SSN digits are 0007.

Suppose that a customer table containing customer records has these columns:

  • id: Customer ID number.

  • first_name: Customer first name.

  • last_name: Customer last name.

  • ssn: Customer Social Security number.

For example, the table might be defined as follows:

CREATE TABLE customer
(
  id         BIGINT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  first_name VARCHAR(40),
  last_name  VARCHAR(40),
  ssn        VARCHAR(11)
);

The application used by customer-service representatives to check the customer SSN might execute a query like this:

mysql> SELECT id, ssn
mysql> FROM customer
mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
+-----+-------------+
| id  | ssn         |
+-----+-------------+
| 786 | 906-39-0007 |
+-----+-------------+

However, that exposes the SSN to the customer-service representative, who has no need to see anything but the last four digits. Instead, the application can use this query to display only the masked SSN:

mysql> SELECT id, mask_ssn(CONVERT(ssn USING binary)) AS masked_ssn
mysql> FROM customer
mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
+-----+-------------+
| id  | masked_ssn  |
+-----+-------------+
| 786 | XXX-XX-0007 |
+-----+-------------+

Now the representative sees only what is necessary, and customer privacy is preserved.

Why was the CONVERT() function used for the argument to mask_ssn()? Because mask_ssn() requires an argument of length 11. Thus, even though ssn is defined as VARCHAR(11), if the ssn column has a multibyte character set, it may appear to be longer than 11 bytes when passed to a UDF, and an error occurs. Converting the value to a binary string ensures that the UDF sees an argument of length 11.

A similar technique may be needed for other data masking functions when string arguments do not have a single-byte character set.

Creating Views that Display Masked Data

If masked data from a table is used for multiple queries, it may be convenient to define a view that produces masked data. That way, applications can select from the view without performing masking in individual queries.

For example, a masking view on the customer table from the previous section can be defined like this:

CREATE VIEW masked_customer AS
SELECT id, first_name, last_name,
mask_ssn(CONVERT(ssn USING binary)) AS masked_ssn
FROM customer;

Then the query to look up a customer becomes simpler but still returns masked data:

mysql> SELECT id, masked_ssn
mysql> FROM masked_customer
mysql> WHERE first_name = 'Joanna' AND last_name = 'Bond';
+-----+-------------+
| id  | masked_ssn  |
+-----+-------------+
| 786 | XXX-XX-0007 |
+-----+-------------+