Vector Embeddings

Vector embeddings are the representations or encodings of tokens, such as sentences, paragraphs, or documents, in a high-dimensional vector space, where each dimension corresponds to a learned feature or attribute of the language.

Vector embeddings are compact numerical representations of data that facilitate enhanced data analysis, search, and machine learning capabilities. Embeddings transform raw data into dense, meaningful vectors, allowing for efficient similarity comparisons, recommendation systems, and various other advanced data processing tasks.

Embeddings are the way that the model captures and stores the meaning and the relationships of the language, and the way that the model compares and contrasts different tokens or units of language. Embeddings are the bridge between the discrete and the continuous, and between the symbolic and the numeric, aspects of language for the model.

For example: a sample embedding of 12 dimensions would look like the following:

[	
  0.002243932,	
  -0.009333183,	
  0.01574578,	
  -0.007790351,	
  -0.004711035,	
  0.014844206,	
  -0.009739526,	
  -0.03822161,	
  -0.0069014765,	
  -0.028723348,	
  0.02523134,	
  0.01814574
]

Using vector embeddings in your Striim application

Striim allows you to configure an embeddings model that you can use to generate vector embeddings. After generating the vector embeddings, we expose the capability for the user to store their embeddings in an appropriate data storage. Vector embeddings are commonly stored in vector databases, such as Pinecone, ChromaDB, and FAISS. We will provide such options to the user in the form of target adapters for your embeddings.

To use vector embeddings within your application, you can create an Embedding Generator object which you can invoke within a built-in function inside a CQ. This object contains the connection and setup details for a single embeddings model you can use to generate vector embeddings, which you can write to supported Striim targets.

By exposing a built-in function to generate embeddings from user data, you have the ability to apply any SQL related transformations to the data columns you would like to create embeddings from. There are two ways in which you can create and configure the embedding generator object:

Console: in the console there is a new section where you can perform CRUD operations on the embeddings object
Tungsten console - you can use the regular create or replace statement (see syntax changes section below)

When creating the embedding generator object, you provide the name of the object, the name of the AI model (see options below), your API key, and any other necessary account details to use the embeddings API.

Creating a vector embeddings generator in the console

To create a vector embeddings generator object from the console:

Select Vector Embeddings Generator from the Striim AI menu.
Select to create a Vector Embeddings Generator object.
Configure the following settings:
- Name
- Namespace
- AI model provider: specify OpenAI or Vertex AI
For OpenAI configure the following:
- API key
- Model name
- Organization ID
For Vertext AI configure the following:
- Project
- Model name
- Service account key
- Location
- Publisher

Creating a vector embeddings generator using TQL

Sample application with an embeddings generator object:

CREATE OR REPLACE APPLICATION VectorDB_Demo;

CREATE OR REPLACE SOURCE ReadChangesFromOracleDB USING OracleReader (
  Username: '**',
  Password: '**',
  Tables: 'AIDEMO.PRODUCTS',
  ConnectionURL: 'jdbc:oracle:thin:@oracle19c-infra577.chgfpqedfpiv.us-west-1.rds.amazonaws.com:1521/orcl'
) 
OUTPUT TO sourceCDCStream;

CREATE OR REPLACE EMBEDDINGS_GENERATOR OpenAIEmbedder26 WITH (
modelProvider: 'OpenAI',
modelName: 'text-embedding-ada-002',
apiKey: '***'
);

CREATE CQ EmbGenCQ
INSERT INTO EmbeddingCDCStream
SELECT putUserData(e, 'embedding', java.util.Arrays.toString(generateEmbeddings("admin.OpenAIEmbedder26", TO_STRING(GETDATA(e, "description")))))
FROM sourceCDCStream e;

CREATE OR REPLACE TARGET DeliverEmbeddingChangesToPostgresDB USING DatabaseWriter (
  Tables: 'AIDEMO.PRODUCTS,aidemo.products2 ColumnMap(product_id=product_id,product_name=product_name,description=description,list_price=list_price,embedding=@USERDATA(embedding))',
  Username: '**',
  Password: '**',
  BatchPolicy: 'EventCount:100,Interval:2',
  CommitPolicy: 'EventCount:100,Interval:2',
  ConnectionURL: 'jdbc:postgresql://34.173.25.51:5432/postgres?stringtype=unspecified'
 ) INPUT FROM EmbeddingCDCStream;

END APPLICATION VectorDB_Demo;

Embeddings model type options and supported models

The following are the model types available and the connection parameters needed:

Table 1. Embeddings model type options

AI model provider	Embeddings model name (dimensions)	Organization	Required connection parameters
OpenAI	GPT text-embedding-ada-002(1536)	OpenAI	API_Key, OrganizationID (optional)
VertexAI	textembedding-gecko(768)	Google	ProjectId, Location, Provider, ServiceAccountKey

Table 2. Supported models

Model name	Model provider	Token limit	Dimensions
text-embedding-ada-002	OpenAI	8192	1536
text-embedding-3-small	OpenAI	8192	1536
text-embedding-3-large	OpenAI	8192	3072
textembedding-gecko@001	VertexAI	3092	768
textembedding-gecko-multilingual@001	VertexAI	3092	768
textembedding-gecko@002	VertexAI	3092	768
textembedding-gecko@003	VertexAI	3092	768
textembedding-gecko@latest	VertexAI	3092	768
textembedding-gecko-multilingual@latest	VertexAI	3092	768

Table 3. Default models

Model provider	Model name
OpenAI	text-embedding-3-small
VertexAI	textembedding-gecko@003

In this section:

Striim Platform 5.0 documentation