Skip to main content

Striim Platform 5.0 documentation

Vector Embeddings

Vector embeddings are representations or encodings of tokens, such as sentences, paragraphs, or documents, in a high-dimensional vector space, where each dimension corresponds to a learned feature or attribute of the language. Vector embeddings facilitate enhanced data analysis, search, and machine learning capabilities. Transforming raw data into dense, meaningful vectors allows for efficient similarity comparisons, recommendation systems, and various other advanced data processing tasks.

For example: a sample embedding of 12 dimensions would look like the following:

[	
  0.002243932,	
  -0.009333183,	
  0.01574578,	
  -0.007790351,	
  -0.004711035,	
  0.014844206,	
  -0.009739526,	
  -0.03822161,	
  -0.0069014765,	
  -0.028723348,	
  0.02523134,	
  0.01814574
]

Using vector embeddings in your Striim application

Striim allows you to configure an embeddings model that you can use to generate vector embeddings. After generating the vector embeddings, we expose the capability for the user to store their embeddings in an appropriate data storage. Vector embeddings are commonly stored in vector databases, such as Pinecone, ChromaDB, and FAISS. We will provide such options to the user in the form of target adapters for your embeddings.

To use vector embeddings within your application, you can create an Embedding Generator object which you can invoke within a built-in function inside a CQ. This object contains the connection and setup details for a single embeddings model you can use to generate vector embeddings, which you can write to supported Striim targets.

By exposing a built-in function to generate embeddings from user data, you have the ability to apply any SQL related transformations to the data columns you would like to create embeddings from. There are two ways in which you can create and configure the embedding generator object:

  • Console: in the console there is a new section where you can perform CRUD operations on the embeddings object

  • Tungsten console - you can use the regular create or replace statement (see syntax changes section below)

When creating the embedding generator object, you provide the name of the object, the name of the AI model (see options below), your API key, and any other necessary account details to use the embeddings API.

Creating a vector embeddings generator in the console

To create a vector embeddings generator object from the console:

  1. Select Vector Embeddings Generator from the Striim AI menu.

  2. Select to create a Vector Embeddings Generator object.

  3. Configure the following settings:

    • Name

    • Namespace

    • AI model provider: specify OpenAI or Vertex AI

  4. For OpenAI configure the following:

    • API key

    • Model name

    • Organization ID

  5. For Vertext AI configure the following:

    • Project

    • Model name

    • Service account key

    • Location

    • Publisher

Creating a vector embeddings generator using TQL

Sample TQL for creating an embeddings generator object:

CREATE OR REPLACE EMBEDDINGGENERATOR OpenAIEmbedder2 USING OpenAI (
modelName: 'text-embedding-ada-002',
apiKey: '**',
'organizationID': 'orgID' // optional
);

CREATE CQ EmbGenCQ
INSERT INTO EmbeddingCDCStream
SELECT putUserData(e, 'embedding', java.util.Arrays.toString(generateEmbeddings("admin.OpenAIEmbedder26",
  TO_STRING(GETDATA(e, "description")))))
FROM sourceCDCStream e;

CREATE OR REPLACE TARGET DeliverEmbeddingChangesToPostgresDB USING DatabaseWriter (
  Tables: 'AIDEMO.PRODUCTS,aidemo.products2
    ColumnMap(product_id=product_id,product_name=product_name,description=description,list_price=
    list_price,embedding=@USERDATA(embedding))',
  Username: 'example',
  Password: 'example',
  BatchPolicy: 'EventCount:100,Interval:2',
  CommitPolicy: 'EventCount:100,Interval:2',
  ConnectionURL: 'jdbc:postgresql://url:port/postgres?stringtype=unspecified'
 ) INPUT FROM EmbeddingCDCStream;

Supported targets for writing vector embeddings

The following are the supported targets for writing vector embeddings and the recommended data types.

Target type

Recommended data type

Target version

PostgreSQL

vector(<dimension>)

PG16

Snowflake

Array

BigQuery

array<float64>

MongoDB

n/a

Spanner

Array

Databricks

Array

Azure SQL database

varchar(max)

Fabric Lakehouse and Fabric Data Warehouse

String

Single Store MEMSQL

For SingleStore versions 8.5 and above: Vector

For SingleStore versions below 8.5: Varchar

Both 8.5 and above and below 8.5

Oracle

Vector

Oracle 23ai

Embeddings model type options and supported models

The following are the model types available and the connection parameters needed:

Table 1. Embeddings model type options

AI model provider

Embeddings model name (dimensions)

Organization

Required connection parameters

OpenAI

GPT text-embedding-ada-002(1536)

OpenAI

API_Key, OrganizationID (optional)

VertexAI

textembedding-gecko(768)

Google

ProjectId, Location, Provider, ServiceAccountKey



Table 2. Supported models

Model name

Model provider

Token limit

Dimensions

text-embedding-ada-002

OpenAI

8192

1536

text-embedding-3-small

OpenAI

8192

1536

text-embedding-3-large

OpenAI

8192

3072

textembedding-gecko@001

VertexAI

3092

768

textembedding-gecko-multilingual@001

VertexAI

3092

768

textembedding-gecko@002

VertexAI

3092

768

textembedding-gecko@003

VertexAI

3092

768

textembedding-gecko@latest

VertexAI

3092

768

textembedding-gecko-multilingual@latest

VertexAI

3092

768



Table 3. Default models

Model provider

Model name

OpenAI

text-embedding-3-small

VertexAI

textembedding-gecko@003



Using batching when generating vector embeddings

The use of batch processing can improve performance for use cases where you want to generate vector embeddings. Batching involves aggregating the events and data to be embedded, generating the embeddings and aggregating into a list using the generateEmbeddingsPerBatch function, and flattening the nested data structure into a single, one-dimensional list for processing by the model.

Note

Striim currently supports batching only for vector embedding models configured with OpenAI.

// Window to batch
CREATE JUMPING WINDOW <window_name> OVER <Stream>
KEEP <window_policy>; 

// Aggregate the events and data to be embedded.
CREATE OR REPLACE CQ <CQ_name>
INSERT INTO <stream>
SELECT list(w) as events, list(TO_STRING(<col_name>)) as data FROM <window_name> w; 

// Generate embeddings and aggregated it into a list.
CREATE CQ <CQ_Name>
INSERT INTO <out_stream>
SELECT makeTupleList(a.events, generateEmbeddingsPerBatch(<namespace>.<object_name>, List<data>)) as objs FROM <int_stream> a; 

// unRoll the list.CREATE CQ <CQ_name>
INSERT INTO <out_stream>
SELECT putUserData(cast(origevent.get(0) as com.webaction.proc.events.WAEvent), 'embedding', java.util.Arrays.toString(cast(origevent.get(1) as java.lang.Float[]))) FROM embOut e, ITERATOR (e.objs, java.util.List) as origevent;