Sentinel AI: real-time sensitive data detection and protection

You can use the Sentinel AI Agent to detect and protect sensitive data flowing in your Striim applications. Sentinel is available as a component in the Striim Flow Designer, and you can place it anywhere in your application. Sentinel scans every event that passes through it for sensitive data and takes actions on the sensitive based on your specifications, all in real-time.

Actions on Data Identifiers: Real-time detection and protection of sensitive data

When enabled, Sentinel uses Striim AI to detect sensitive data in its input data stream. Striim AI uses a combination of classification algorithms, pattern matching, and analysis of the data and the metadata, including the field names, to identify sensitive information.

Sentinel scans the input record for information that matches supported Sensitive Data Identifiers such as email addresses (EMAIL_ADDRESS), US Social Security Numbers (USA_SOCIAL_SECURITY_NUMBER), UK National Insurance Numbers (GBR_NATIONAL_INSURANCE_NUMBER), or the India Unique Identification Number (IND_AADHAAR_NUMBER). The accuracy of detection can depend on the AI engines that you use for Sentinel, and the AI engines may sometimes misclassify information. After Sentinel has identified and matched sensitive information to a supported Sensitive Data Identifier, it implements actions such as partial masking or encryption on the information that you have specified for the matching Sensitive Data Identifier.

These real-time actions on sensitive data – detection, identification, protection and event tagging (described later) – are driven by Striim AI, and collectively known as Actions on Data Identifiers because they are solely based on the source information matching a Sensitive Data Identifier, irrespective of the information’s location in the input record.

For example, in the figure below, we highlight the sensitive information that Sentinel has identified in the input record. Assume that you have configured Sentinel with the following actions in accordance with your organization’s data governance policies:

NAME, PHONE_NUMBER: No action
EMAIL_ADDRESS: Mask the complete email address
USA_SOCIAL_SECURITY_NUMBER: Redact a US Social Security Number by displaying the last 4 digits and masking the remaining digits

In other words, you want Sentinel to take no action on sensitive information that it classifies as NAME or PHONE_NUMBER and to let such information pass through to the output as is, while you want Sentinel to protect information that it classifies as EMAIL_ADDRESS or USA_SOCIAL_SECURITY_NUMBER and then write it to the output.

Sentinel executes your Actions on Data Identifiers as shown in the output record in the figure above. Sentinel can detect and identify the sensitive information in your records, subject to the accuracy of the AI engines that you use for Sentinel, and then implement your specified actions on the identified sensitive information, all in real time. You do not have to know about the sensitive information in your records or where they are located; Sentinel can detect and manage the sensitive data in the input data stream according to your specifications. The Actions on Data Identifiers support schema evolution. If new columns, fields, files, tables, collections or objects are added to your configured source, and thereby to your application, then Sentinel can automatically detect, identify and protect sensitive information in the newly added data, thus ensuring that your application remains governed through its lifecycle.

These Actions on Data Identifiers depend solely on the type of sensitive information and not on its location in the input record. For example, in the input record in the figure below, a US Social Security Number has been mistakenly placed in the field named “Department”. Since you have configured Sentinel to act on USA_SOCIAL_SECURITY_NUMBER, Sentinel is able to identify and partially mask this information, thereby preventing this sensitive information from flowing further downstream in its raw form.

Additionally, Sentinel can inform you about the numbers and types of sensitive information that it detected in the input record, and it can tag this event by adding this information as AIData to the Striim event type (such as WAevent or JSONNodeEvent). AIData is conceptually similar to UserData (see: Adding user-defined data to WAEvent streams and Adding user-defined data to JSONNodeEvent streams), except that it is generated by Striim. The output stream from Sentinel contains the event AIData, and you can query the output stream to learn about the sensitive data that Striim has detected in your record. When you write the record to the target, you can also write the event AIData to the target so that you can know about the sensitive data that Sentinel detected in that record, else the AIData information for that record will be lost forever.

The figure above shows an example of the event AIData generated by Sentinel for the input record. The input record contains 1X NAME, 2X EMAIL _ADDRESS, 1X PHONE_NUMBER and 1X USA_SOCIAL_SECURITY_NUMBER, and Sentinel has informed you that it detected 5 occurrences (Occurrences) of 4 types of sensitive data (Identifiers) in the input record. If you choose to tag your events with AIData, then Sentinel will always write the number and types of sensitive data types that it detected in the input record.

You can configure Sentinel to write the type and location of the sensitive data in the event to AIData. In the figure above, Sentinel has informed you that it detected an email address in the fields named “Email.home” and “Email.work”, as you likely expected. You can query the event AIData in the output stream from Sentinel or from the data written to the target to know if the input record contains the expected number and types of sensitive data and in their expected location. For example, in the figure above, Sentinel informed you that it detected an US Social Security number in the field named "Department" that seems contrary to your expectations. By querying the AIData, you can quickly know that you may have an upstream data quality issue and the records that are impacted, and you can take appropriate steps to address the problem.

You can also choose the Sensitive Data Identifiers that Sentinel must tag in the event AIData. First, your Striim Admin must classify the supported Sensitive Data Identifiers as "Low Importance", "Medium Importance" or "High Importance", in accordance with your organization’s data governance policies. Next, in your Sentinel configuration, you can specify the importance level of the Identifiers that Sentinel must tag in the event AIData. For example, if you specify that all "High Importance" and "Medium Importance" Identifiers must be tagged, then Sentinel will write the location of any information that matches a Sensitive Data Identifier that your Striim Admin has classified as "High Importance" or "Medium Importance".

Actions on Fields: Real-time actions on sensitive data

Sentinel supports another method to act on sensitive data. If you are certain that your sensitive information is located in specific fields or columns in your source dataset such as a table, collection, directory or topic-partition, then you can use Sentinel’s Actions on Fields to take actions such as masking or encryption on the data located in those fields or columns.

For example, assume that you know that the EMP tables in your source database contain columns such as "SSN" and "Bank Account No." that you want to obfuscate before the data in these columns flows further downstream. You can use Sentinel’s Actions on Fields to take actions such as masking or encryption on all data in the named columns such as "SSN" and "Bank Account No." in a specific table and take no action on data in other columns, as shown in the figure below.

You can configure a single Sentinel component to act on data in multiple columns in multiple tables, with a different action for each named field. As the name suggests, the Actions on Fields are designed solely to act on data in one or more specified fields. These Actions do not use Striim AI to detect sensitive data in every event in the the input data stream; therefore, they offer superior performance compared to Actions on Data Identifiers. Since the Actions on Fields do not know if the input stream contains sensitive data, they do not support tagging events with AIData. They are also unable to protect you in case sensitive information is erroneously placed in the source dataset. For example, in the above figure, a US Social Security Number that was erroneously placed in the "Email" column in the source table will flow downstream if your Actions on Fields did not protect the data in the "Email" column. Additionally, if you add a new column or table to the configured source dataset and, thereby, to the Striim application, any sensitive data in the newly added column or table will flow downstream as is until you manually update the Actions on Fields to cover the newly-added column or table.

You can use Sentinel’s Actions on Fields independently or in conjunction with Actions on Data Identifiers. If you configure both Actions on Fields and Actions on Data Identifiers, then the Actions on Fields will override the Actions on Data Identifiers for information in the same field or column. For example, assume that you are using Striim to read from 1,000 tables in your source database, and your organization requires you to protect all credit card numbers before they flow further downstream. Assume that you need to share 10 tables with an external vendor and you have chosen to completely mask the credit card numbers in those 10 tables so that the vendor can never know the original credit card numbers. You can easily achieve this using Sentinel by configuring the Actions on Data Identifiers to encrypt all credit card numbers, and then configuring the Actions on Fields to completely mask the fields in the 10 tables that contain credit card numbers.

Typical use cases for Actions on Fields include the following.

Your cannot use the Striim AI -powered Actions on Data Identifiers because your organization has not yet formulated its AI policies. Therefore, you are starting with the Actions on Fields for sensitive data governance.
You are certain that sensitive information is placed in specific fields or columns in your source entities such as tables, collections, directories or topic-partitions, and you want to take actions on data in those specific fields, columns or other locations only. You are also confident that sensitive information has not been erroneously placed in other locations in your source entities.
You are transitioning to Striim from another vendor that also provides sensitive data governance on named fields. You can initially use Sentinel’s Actions on Fields to match the actions of your current vendor. At a later time, you can switch to using Sentinel’s Actions on Data Identifiers and let Striim AI manage sensitive data governance in your applications.
You can specify "No Action" for a field that is the primary key, foreign key, or part of a composite primary key at the target to ensure that data in that field, including sensitive data, is not masked or encrypted when it flows downstream because the the Actions on Fields override the Actions on Data Identifiers for data in the same named field.

Actions on sensitive data

Sentinel supports the following actions on sensitive information.

No action: Take no action and let the sensitive information flow downstream as is. This action is supported for Actions on Data Identifiers and Actions on Fields.
Encrypt: Encrypt the sensitive information with your Google KMS keys. This action is supported for Actions on Data Identifiers and Actions on Fields.
Mask completely: Replace all characters in the sensitive information in the input record, including delimiters such as spaces or hyphens, with “x” and then write to the output. This action is supported for Actions on Data Identifiers and Actions on Fields.
Custom masking: These actions are supported only for Actions on Data Identifiers.
1. Redact the sensitive information by displaying a user-specified number of characters at beginning and end of the sensitive information, and masking all other characters in the middle with “x”.
2. Use a regex pattern to mask sensitive information.
The encrypted or masked message in the output stream is a string, regardless of the data type of the information in the input stream. You must modify the configuration at the target so that it can handle the string field, else your application will halt.

Encrypt

You can encrypt the sensitive information using your Google KMS keys. This action is similar to Encrypting data using Shield.

Mask completely

Sentinel can completely mask the input sensitive information by replacing every character, including delimiters such as spaces, hyphens or commas, with “x” as shown in the examples below:

1234567812345678 → xxxxxxxxxxxxxxxx
1234-5678-1234-5678 → xxxxxxxxxxxxxxxxxxx
1234 5678 1234 5678 → xxxxxxxxxxxxxxxxxxx

Custom masking: Redaction

Sentinel can redact your sensitive information by masking all characters with “x” except for a fixed number of characters at the beginning and at the end of the information, as shown in the table below. You must specify the number of characters that must be excluded from the masking.

If the number of characters in the input message is equal to or less than the total number of first and last characters that you want shown in the output message, then your input message will not be masked, as shown in the table below.

Custom masking: Using regex

You can specify a regular expression (regex) pattern that Sentinel must search for and mask. When you supply a regex pattern, Sentinel will replace each match with “x”, repeated for the length of the match.

For example, assume that your source contains US Social Security Numbers in the format 123-45-6789, and you want to mask these numbers while retaining the format. You can provide a regex pattern such as (\\d{3})-(\\d{2}), and Sentinel’s output will be xxx-xx-6789 for sensitive information that it identifies as USA_SOCIAL_SECURITY_NUMBER and that matches the supplied regex pattern.

Using Sentinel in your application

Sentinel is available as a component in the flow designer palette. You can place Sentinel in your application and configure it as shown below. You must set up the AI engines to work with Striim before you can run your application with Sentinel.

Provide a name for the Sentinel component.
Specify the input stream that Sentinel will read from.
To enable Sentinel for real-time sensitive data detection and protection using Striim AI,
1. Under the section on “Actions on Data Identifiers “, you must enable Sentinel to detect and take actions on sensitive data in real-time anywhere in the stream. If you do not enable real-time detection, then Sentinel will not scan the input events for sensitive information, or identify the sensitive information as a supported Sensitive Data Identifier, or implement the Actions on Data Identifiers, or tag the scanned events with AIData, or report any performance metrics.
2. If you want Sentinel to tag the scanned events with AIData, then you must enable event tagging.
  1. If event tagging is enabled, Sentinel will tag events with the number of sensitive data types that it has detected in the input event.
  2. Optionally, you can specify the importance levels of the Sensitive Data Identifiers that you want Sentinel to tab. For example, if you select "High Importance", then Sentinel will write the location of any sensitive information that matches a Sensitive Data Identifier that your Striim Admin has classified as "High Importance".
3. You can specify the Policy Actions that you want Sentinel to take on sensitive information that matches specific Sensitive Data Identifiers.
  1. For example, you can specify “Mask Completely” as the action for PASSPORT_NUMBER, and Sentinel will completely mask any information in the input event that it identifies as a passport number, irrespective of the location of that information in the input event.
  2. If you do not list a Sensitive Data Identifier under Policy Actions, then Sentinel will detect it, tag it in AIData, but not take any action on it.
To enable Sentinel to take actions on data in specific fields,
1. Under the section on “Actions on Fields”, you must specify the details of the source entity, field and the action to be taken on data in that field.
  1. You can think of an entity as a logical grouping or collection of data objects or elements that share a common structure or purpose such as a table in a relational database, or a collection in a NoSQL database, or a directory in a file system, or a folder in a cloud storage bucket, or a topic partition in Kafka.
  2. You can think of a field in the context as a field in a NoSQL database collection or a column in a relational database table.
2. If you configure both Actions on Fields and Actions on Data Identifiers, then the Actions on Fields will override the Actions on Data Identifiers for information in the same field or column.
Specify the output stream that Sentinel will publish to.

Viewing Sentinel results

You can go the Sensitive Data Governance icon on the flow designer top bar and view the Sentinel results. You can select to view results for the last hour, or for the last 24 hours. Sentinel does not store results beyond the last 24 hours.

Results include:

Events tagged as sensitive: The number of events tagged by Sentinel with information about the number and types of sensitive data detected in the scanned event and, optionally, the location of specific Sensitive Data Identifiers in the scanned event.
Sensitive Data Identifiers: The number of Sensitive Data Identifiers that Sentinel detected in the input stream in the specified time period. In this screenshot, Sentinel is reporting that it detected names, phone numbers, vehicle identification numbers, email addresses and 17 other sensitive data identifiers in the last 1 hour.
Occurrences of sensitive data: The total number of occurrences, or total count, of sensitive data of all supported types in the specified time period, broken down by the configured Policy Actions. In the screenshot, Sentinel is reporting that it detected 3.08K occurrences of 21 types of sensitive data in the last 1 hour, of which 814 were name, 533 were phone numbers, and so on. Additionally, Sentinel performed some type of masking action on 1.51K sensitive data occurrences, in accordance with the user's Policy Actions.

Note

Striim will only report metrics from Sentinel's Actions on Data Identifiers.

Runtime considerations

When you enable Sentinel’s Actions on Data Identifiers, you may see an increase in your end-to-end latency and your overall resource consumption (CPU and memory usage in the Striim server) because Striim AI will scan every event for sensitive information. Striim recommends provisioning adequate resources - at least one additional cluster node - to ensure that the performance of your Striim applications is not adversely impacted.
Do not mask or encrypt a field that is can be used as a primary key, foreign key, or part of a composite primary key at the target. If you use sensitive data as key, you can ensure that it is not obfuscated when it flows downstream by specifying "No Action" for any field that is used as a key.

Limitations

The accuracy of sensitive data detection and identification in Sentinel’s Actions on Data Identifiers depends on the AI engines used, and the AI engines may sometimes misclassify information.
Sentinel AI does not currently support the detection of sensitive information in free-form text where only a certain part or parts of the textual content are sensitive.
Sentinel does not support binary data sources such as image, movie, audio, PDF or application files, or binary data types such as BLOB (Binary Large Object).
Sentinel does not support Actions on Data Identifiers on data that you read from SaaS applications, namely Atlassian Jira, Google Ads, Hubspot, Intercom, Salesforce, ServiceNow, Stripe and Zendesk.

In this section:

Striim Platform 5.0 documentation