Skip to main content

Striim Platform 5.0 documentation

Sentinel AI: real-time sensitive data detection and protection

You can use the Sentinel AI Agent to detect and protect sensitive data flowing in your Striim applications in real-time. Sentinel is available as a component in the Striim Flow Designer, and you can place it anywhere in your application. Sentinel can take 2 types of actions on the data that flows through it.

  1. Actions on Data Identifiers: Sentinel uses Striim AI to detect sensitive information in every event that pass through it, matches it to supported Sensitive Data Identifiers such as email addresses (EMAIL_ADDRESS), US Social Security Numbers (USA_SOCIAL_SECURITY_NUMBER), or UK National Insurance Numbers (GBR_NATIONAL_INSURANCE_NUMBER), and then takes protective actions such as masking or encrypting the sensitive information in the input event before it flows further downstream. For e.g., if you configure Sentinel to encrypt credit card numbers, then Sentinel will encrypt any information in the input event that matches a credit card number, irrespective of the location of that information in the input event. Additionally, you can also tag the input event with information about the sensitive data that Sentinel discovered in the event, and you can store this information in downstream systems for audit and analytics.

  2. Actions on Fields: If you know the fields in the input data stream that contain sensitive data, you can specify that Sentinel takes actions such as masking or encryption on data in those fields. Here, Sentinel does not detect sensitive data in the input event, using Striim AI or other means; instead, it relies on you to specify its actions. For e.g., you can specify that Sentinel should mask the field named "Card Number", and Sentinel will mask any information in the input event that is stored in the field named "Card Number", irrespective of the nature of that information.

Actions on Data Identifiers:  Real-time detection and protection of sensitive data

When enabled, Sentinel uses Striim AI to detect and identify sensitive data in its input data stream, protect it as per your specifications for the matched Sensitive Data Identifiers, and tag the event with AIData. Collectively, these actions are known as Actions on Data Identifiers.

Sensitive data detection

Sentinel uses Striim AI to detect sensitive data in every input event and match it to a supported Sensitive Data Identifier. Sentinel's accuracy depends on the AI engines used, and these AI engines may occasionally misclassify information.

Sensitive data protection

After detecting sensitive information in the input event, Sentinel can protect the information by encrypting or masking it before it flows further downstream. For example, in the figure below, assume that you have configured Sentinel with the following actions in accordance with your organization’s data governance policies (under Policy Actions on the Sentinel property panel):

  • NAME:  No action

  • EMAIL_ADDRESS:  Custom masking using a regex expression that revels only the first letter and the domain name in the email address

  • CREDIT_CARD_NUMBER:  Mask completely

  • USA_SOCIAL_SECURITY_NUMBER:  Encrypt

As shown below, Sentinel protects sensitive information such as EMAIL_ADDRESS, CREDIT_CARD_NUMBER, and USA_SOCIAL_SECURITY_NUMBER by obfuscating them in the input event before publishing it to the output. Since you have not specified protective actions, i.e. encrypt or mask, for other Sensitive Data Identifiers, Sentinel will take no action when it detects sensitive information such as NAME, PHONE_NUMBER, ADDRESS, IND_AADHAAR_NUMBER, or GBR_NATIONAL_INSURANCE_NUMBER, and it will let such information pass through as is to the output.

sentinel-concept-actions-data-identifiers-v4.png

The Actions on Data Identifiers depend solely on the type of sensitive information and not on its location in the input record. For example, in the input record in the figure below, a US Social Security Number has been mistakenly placed in the field named “Department”.  Since you have configured Sentinel to act on USA_SOCIAL_SECURITY_NUMBER, Sentinel is able to identify and partially mask this information, thereby preventing this sensitive information from flowing further downstream.

The Actions on Data Identifiers also support schema evolution. If new columns, fields, files, tables, collections or objects are added to your configured source, and thereby to your application, then Sentinel can automatically detect, identify and protect sensitive information in the newly added data, thus ensuring that your application remain governed through its lifecycle.

Event tagging

Sentinel can inform you about the sensitive data that it detected in the input record, and it can tag the input event by adding this information as AIData to the Striim event type (such as WAevent or JSONNodeEvent), as shown in the figure above.  AIData is conceptually similar to UserData (see: Adding user-defined data to WAEvent streams and Adding user-defined data to JSONNodeEvent streams), except that it is generated by Striim.  The output stream from Sentinel contains the event AIData, and you can query the output stream to learn about the sensitive data that Striim has detected in your record.  When you write the record to the target, you can also write the event AIData to the target so that you can know about the sensitive data that Sentinel detected in that record, else the AIData information for that record will be lost forever.

Your Striim Admin must classify the supported Sensitive Data Identifiers as "High Importance", "Medium Importance", or "Low Importance", in accordance with your organization's data governance policies. In your Sentinel configuration, you can specify the importance level of the Identifiers that Sentinel must tag in the event AIData. For example, in the figure above, assume that your Striim Admin has retained the default Sensitive Data Identifier importance classification, and you have configured Sentinel as shown. Accordingly, the event AIData contains the following information in JSON format:

  1. Number of sensitive data identifiers in the input event: By default, Sentinel will inform you about the number of sensitive data types that it identified in the input event. Sentinel will scan the input word for the Sensitive Data Identifiers that you have enabled for event tagging and the Sensitive Data Identifiers that you have specified for protective actions, i.e. encrypt or mask, under Policy Actions. For example, in the figure above, Sentinel will scan the input event for information that matches a "High Importance" or "Medium Importance" Sensitive Data Identifier, in addition to EMAIL_ADDRESS, CREDIT_CARD_NUMBER, and USA_SOCIAL_SECURITY_NUMBER. As shown in the event AIData in the figure above, Sentinel has as detected that the input record contains 2X EMAIL_ADDRESS, 1X PHONE_NUMBER and 1X USA_SOCIAL_SECURITY_NUMBER. Accordingly,Sentinel has informed you that it detected 3 occurrences (numOccurrences) of 2 types of sensitive data (numIdentifiers) in the input record. 

  2. Location of sensitive data in the input event: Optionally, you can configure Sentinel to write the type and location of the sensitive data in the input event to AIData. Therefore, in the event AIData in the figure above, Sentinel reported the fields where it detected information that matched a "High Importance" or "Medium Importance" Sensitive Data Identifier, in accordance with your Sentinel specifications.

Typical use cases for event tagging include the following:

  1. You can write the AIData to the external target for audit purposes.

  2. You can analyze the AIData to know if the event contains sensitive data in its expected location, enabling you to investigate the upstream sources for potential data quality issues. For example, in the figure above, Sentinel has informed you that it detected an email address in the fields named “Email.home” and “Email.work”, as you likely expected, and an US Social Security number in the field named "Department" that seems contrary to your expectations.

Actions on Fields:  Real-time actions on sensitive data

Sentinel supports a second method to act on sensitive data in its input data stream. If you are certain that sensitive information is located in specific fields or columns in the source dataset such as a table, collection, directory or topic-partition, then you can use Sentinel’s Actions on Fields to take actions such as masking or encryption on data located in those fields or columns. The Actions on Fields do not support sensitive data detection or event tagging.

You can configure a single Sentinel component to act on input data in one or more columns in one or more tables, with a different action for each named field. For example, assume that you know that the "HR.Accounting" table in your source database contain columns such as "SSN" and "Bank Account No." that you want to obfuscate before the rows flow further downstream. You can use Sentinel’s Actions on Fields to take actions such as masking or encryption on all data in these named columns, as shown in the figure below.

sentinel-concept-actions-entities.png

You can use Sentinel’s Actions on Fields independently or in conjunction with Actions on Data Identifiers.  If you configure both Actions on Fields and Actions on Data Identifiers, then the Actions on Fields will override the Actions on Data Identifiers for information in the same field or column.  For example, assume that you want to mask all email addresses in the input dataset except for email addresses stored in 2 specific columns. You can achieve this using Sentinel by configuring the Actions on Data Identifiers to mask all email addresses, and then configuring Actions on Fields in the same Sentinel to take take no action on the 2 named columns.

Typical use cases for Actions on Fields include the following.

  1. You cannot use the Striim AI -powered Actions on Data Identifiers because your organization has not yet formulated its AI policies. Therefore, you are starting with the Actions on Fields for sensitive data governance.

  2. You are certain that sensitive information is placed in specific fields or columns in your source entities such as tables, collections, directories or topic-partitions, and you want to take actions on data in those specific fields, columns only. 

  3. You are transitioning to Striim from another vendor that also provides sensitive data governance on named fields.  You can initially use Sentinel’s Actions on Fields to match the actions of your current vendor.  At a later time, you can switch to using Sentinel’s Actions on Data Identifiers and let Striim AI manage sensitive data governance in your applications.

  4. You can specify "No Action" for a field that is the primary key, foreign key, or part of a composite primary key at the target to ensure that data in that field, including sensitive data, is not masked or encrypted when it flows downstream because the the Actions on Fields override the Actions on Data Identifiers for data in the same named field.

Actions on sensitive data

Sentinel supports the following actions on sensitive information.

  1. No action:  Take no action and let the sensitive information flow downstream as is.  This action is supported for Actions on Data Identifiers and Actions on Fields.

  2. Encrypt:  Encrypt the sensitive information with your Google KMS keys.  This action is supported for Actions on Data Identifiers and Actions on Fields.

  3. Mask completely:  Replace all characters in the sensitive information in the input record, including delimiters such as spaces or hyphens, with “x” and then write to the output.   This action is supported for Actions on Data Identifiers and Actions on Fields.

  4. Custom masking:  These actions are supported only for Actions on Data Identifiers.

    1. Redact the sensitive information by displaying a user-specified number of characters at beginning and end of the sensitive information, and masking all other characters in the middle with “x”.

    2. Use a regex pattern to mask sensitive information.

Encrypt

You can encrypt the sensitive information using your Google KMS keys.  This action is similar to Encrypting data using Shield

Mask completely

Sentinel can completely mask the input sensitive information by replacing every character, including delimiters such as spaces, hyphens or commas, with “x” as shown in the examples below:

  • 1234567812345678 → xxxxxxxxxxxxxxxx

  • 1234-5678-1234-5678 → xxxxxxxxxxxxxxxxxxx

  • 1234 5678 1234 5678 → xxxxxxxxxxxxxxxxxxx

Custom masking: Redaction

Sentinel can redact your sensitive information by masking all characters with “x” except for a fixed number of characters at the beginning and at the end of the information, as shown in the table below.  You must specify the number of characters that must be excluded from the masking.

sentinel-redaction-table1.png

If the number of characters in the input message is equal to or less than the sum of the total number of first and last characters that you want shown in the output message, then your input message will not be masked, as shown in the table below.

sentinel-redaction-table2.png
Custom masking: Using a Regular Expression

You can specify a regular expression (regex) pattern that Sentinel must search for and mask.  When you supply a regex pattern, Sentinel will replace each match with “x”, repeated for the length of the match.

For example, assume that your source contains US Social Security Numbers in the format 123-45-6789, and you want to mask these numbers while retaining the format.  You can provide a regex pattern such as (\\d{3})-(\\d{2}), and Sentinel’s output will be xxx-xx-6789 for sensitive information that it identifies as USA_SOCIAL_SECURITY_NUMBER  and that matches the supplied regex pattern.

sentinel-regex-table.png
Using Sentinel in your application
Adding a Sentinel component in Flow Designer

Sentinel is available as a component in the flow designer palette.  You can place Sentinel in your application and configure it as shown below. You must set up the AI engines to work with Striim before you can run your application with Sentinel.

  1. Provide a name for the Sentinel component.

  2. Specify the input stream that Sentinel will read from.

  3. To configure Sentinel for real-time sensitive data detection and protection using Striim AI,

    1. Under the section on “Actions on Data Identifiers “, you must enable Sentinel to detect and take actions on sensitive data in real-time anywhere in the stream.  If you do not enable real-time detection, then Sentinel will not detect sensitive data in the input event, or implement Policy Actions on data in the input event that matches a listed Sensitive Data Identifier, or tag the scanned events with AIData, or report any performance metrics.

    2. If you want Sentinel to tag the scanned events with AIData, then you must enable event tagging.

      1. If event tagging is enabled, Sentinel will tag events with the number of sensitive data types that it has detected in the input event.

      2. Optionally, you can specify the importance levels of the Sensitive Data Identifiers that you want Sentinel to tab.  For example, if you select "High Importance", then Sentinel will write the location of any sensitive information that matches a Sensitive Data Identifier that your Striim Admin has classified as "High Importance".

    3. You can specify the Policy Actions that you want Sentinel to take on sensitive information that matches specific Sensitive Data Identifiers.

      1. For example, you can specify “Mask Completely” as the action for PASSPORT_NUMBER, and Sentinel will completely mask any information in the input event that it identifies as a passport number.

      2. If you do not list a Sensitive Data Identifier under Policy Actions, then Sentinel will detect it, tag it in AIData, but not take any action on it.

      3. When you save the Sentinel configuration, it will retain only those Sensitive Data Identifiers whose Policy Action is not set to "No action".

      sentinel_property_panel_actions_on_identifiers.png
  4. To configure Sentinel to take actions on data in specific fields,

    1. Under the section on “Actions on Fields”, you must specify the details of the source entity, field and the action to be taken on data in that field.

      1. You can think of an entity as a logical grouping or collection of data objects or elements that share a common structure or purpose such as a table in a relational database, or a collection in a NoSQL database, or a directory in a file system, or a folder in a cloud storage bucket, or a topic partition in Kafka.

      2. You can think of a field as equivalent to a column in a relational database table, or a field in a NoSQL database collection, or a field in data that you read from a cloud object storage or data lake using Striim's JSON parser

    2. If you configure both Actions on Fields and Actions on Data Identifiers, then the Actions on Fields will override the Actions on Data Identifiers for information in the same field or column.

    sentinel_property_panel_actions_on_fields.png
  5. Specify the output stream that Sentinel will publish to.

Configuring Actions on Fields

The following table provides examples of entities and fields for different sources so that you know how to fill the Actions on Fields section of the Sentinel property panel. For example, when you are reading from a relational database (RBDMS), you can think of the entity as a table in the source database, and the field as a column name in that table. Similarly, when you are reading from a NoSQL database such as MongoDB, you can think of the entity as a collection in the source database, and the field as the field name in the records in that collection.

Source

Readers/Parser

Event type / entity property / field labels

RDBMS

Database Reader

GG Trail Reader

HP NonStop Reader

Incremental Batch Reader

MariaDB Reader

MS Jet Reader

MS SQL Reader

MySQL Reader

Oracle Reader

OJet

PostgreSQL

  • Event type: WAEvent

  • Entity property: Tables

  • Field property: Column

MongoDB and Mongo Cosmos DB

MongoDB Reader, Mongo Cosmos DB Reader

  • Event type: JSONNodeEvent

  • Entity property: Collection

  • Field property: Field

CosmosDB

CosmosDBReader

  • Event type: JSONNodeEvent

  • Entity property: Container

  • Field property: Field

SaaS application: Salesforce

Salesforce Reader

Salesforce CDC Reader

Salesforce Pardot Reader

  • Event type: WAEvent

  • Entity property: sObjects

  • Field property: Field

Other SaaS applications (Hubspot, Stripe and other supported sources)

Google Ads Reader

Hubspot Reader

Intercom Reader

Jira Reader

ServiceNow Reader

Stripe Reader

Zendesk Reader

  • Event type: WAEvent

  • Entity property: Tables

  • Field property: Column

Data Warehouse (Snowflake, BigQuery)

Database Reader

Incremental Batch Reader

Snowflake CDC Reader

  • Event type: WAEvent

  • Entity property: Tables

  • Field property: Column

Kafka and JSON events

Kafka Reader and JSON Parser

  • Event type:

    • JSONNodeEvent

    • XMLNodeEvent

    • AvroEvent

    • ParquetEvent

    • WAEvent (for CSV)

  • Entity property: Topic-Partition

  • Field property:

    • Field (JSON, XML)

    • Column (Avro, Parquet, CSV)

Kafka and XML events

Kafka Reader and XML Parser V2

Kafka and Avro events

Kafka Reader and Avro Parser

Kafka and Parquet events

Kafka Reader and Parquet Parser

Kafka and CSV events

Kafka Reader and DSV Parser

Data lake and JSON events

GCS Reader

S3 Reader

ADLS Reader 

and appropriate parser

  • Event type:

    • JSONNodeEvent

    • XMLNodeEvent

    • AvroEvent

    • ParquetEvent

    • WAEvent (for CSV events)

  • Entity property: Directory

  • Field property:

    • Field (JSON and XML)

    • Column (Avro, Parquet, CSV)

Data lake and XML events

Data lake and Avro events

Data lake and Parquet events

Data lake and CSV events

File and JSON events

File Reader

HDFS Reader

MultiFile Reader

and appropriate parser

  • Event type:

    • JSONNodeEvent

    • XMLNodeEvent

    • AvroEvent

    • ParquetEvent

    • WAEvent (for CSV events)

  • Entity property: Directory

  • Field property:

    • Field (JSON, XML)

    • Column (Avro, Parquet, CSV)

File and XML events

File and Avro events

File and Parquet events

File and CSV events

Using a Sentinel component in TQL

The following is an example of a Sentinel component that detects and takes specific actions on identifiers.

CREATE OR REPLACE SENTINEL mySentinel DETECTION ON
  IDENTIFIER ACTIONS '{"AGE": {"action" : "MASK_COMPLETELY"}, 
  "EMAIL_ADDRESS": {"action" : "MASK_M_N", "m" : "2", "n" : "1"}}',
  entity Actions '{"AIDEMO.CUSTOMER": {"COL5" : {"action" : "MASK_COMPLETELY"}}}'
  TAGGING LEVELS 'HIGH,MEDIUM'
)
INPUT FROM sourceStream
OUTPUT TO PIIStream;

CREATE CQ putSentinelDataInUserData
INSERT INTO FinalStream
SELECT
DATA(e) data,
putUserData(e,'TaggedInfo', getAIData(e, 'Sentinel'))
FROM PIIStream e;
Viewing Sentinel results

You can go the Sensitive Data Governance icon on the flow designer top bar and view the Sentinel results. You can select to view results for the last hour, or for the last 24 hours. Sentinel does not store results beyond the last 24 hour. Additionally, you can download the Sentinel report in PDF or CSV format for further analysis.

Results include:

  • Events tagged as sensitive: The number of events tagged by Sentinel with information about the number and types of sensitive data detected in the scanned event and, optionally, the location of specific Sensitive Data Identifiers in the scanned event.

  • Sensitive Data Identifiers: The number of Sensitive Data Identifiers that Sentinel detected in the input stream in the specified time period. In this screenshot, Sentinel is reporting that it detected names, phone numbers, vehicle identification numbers, email addresses and 17 other sensitive data identifiers in the last 1 hour.

  • Occurrences of sensitive data: The total number of occurrences, or total count, of sensitive data of all supported types in the specified time period, broken down by the configured Policy Actions. In the screenshot, Sentinel is reporting that it detected 3.08K occurrences of 21 types of sensitive data in the last 1 hour, of which 814 were name, 533 were phone numbers, and so on. Additionally, Sentinel performed some type of masking action on 1.51K sensitive data occurrences, in accordance with the user's Policy Actions.

sentinel-dashboard-flowdesigner-1-right-panel.png
sentinel-dashboard-flowdesigner-2-right-panel.png
sentinel-dashboard-flowdesigner-3-right-panel.png

Note

Striim will only report metrics from Sentinel's Actions on Data Identifiers.

Runtime considerations
  1. When you enable Sentinel’s Actions on Data Identifiers, you may see an increase in your end-to-end latency and your overall resource consumption (CPU and memory usage in the Striim server) because Striim AI will scan every event for sensitive information. Striim recommends provisioning adequate resources - at least one additional cluster node - to ensure that the performance of your Striim applications is not adversely impacted.

  2. Do not mask or encrypt a field that can be used as a primary key, foreign key, or part of a composite primary key at the target. If you use sensitive data as key, you can ensure that it is not obfuscated when it flows downstream by specifying "No Action" in the Actions on Fields for any field that is used as a key. You can then continue to configure the Actions on Data Identifiers to mask or encrypt the sensitive data in accordance with your business policies.

  3. The encrypted or masked message in the output stream is a string, regardless of the data type of the information in the input stream.  You must modify the configuration at the target so that it can handle the string field, else your application will halt.

  4. If a field is set to a fixed length, then encrypting the data in that field is likely to cause errors because the encryption will cause the length of the data in that field to increase.

  5. You can optimize the performance of Sentinel’s Actions on Data Identifiers by limiting Striim AI -based sensitive data detection to Sensitive Data Identifiers that are relevant to your business.  To do this: (i) ask your Striim Admin to classify relevant Sensitive Data Identifiers as “High Importance” or “Medium Importance” based on your organization’s data governance policies, and to classify the Identifiers that are not relevant as “Low Importance”, and (ii) do not enable “Low Importance” Identifiers for event tagging in Sentinel.  Sentinel will then scan the input event for the Sensitive Data Identifiers that you have enabled for event tagging and the Identifiers that you have specified for protective actions, i.e. encrypt or mask, under Policy Actions.  In this scenario, Sentinel will not scan for “Low Importance” Identifiers or tag them in AIData, thus improving performance.  If you need to protect specific “Low Importance” Identifiers, you can list them under Sentinel’s Policy Actions, and Sentinel will scan the input records for those Identifiers, but won’t tag them in the event AIData.

Limitations
  1. The accuracy of sensitive data detection and identification in Sentinel’s Actions on Data Identifiers depends on the AI engines used. AI features are not always accurate or error-free, and you acknowledge and agree that Striim AI (including Sentinel AI) may not properly detect, classify or encrypt, mask or otherwise protect all sensitive and other targeted information.

  2. Sentinel AI does not currently support the detection of sensitive information in free-form text where only a certain part or parts of the textual content are sensitive.

  3. Sentinel does not support binary data sources such as image, movie, audio, PDF or application files, or binary data types such as BLOB (Binary Large Object).

  4. Sentinel partially supports Actions on Data Identifiers on data that you read from SaaS applications, namely Atlassian Jira, Google Ads, Hubspot, Intercom, Salesforce, ServiceNow, Stripe and Zendesk. Sentinel can detect sensitive information in the incoming messages, but it cannot perform Policy Actions.