OK, the title may be a little exaggerated, but still, our little Knowledge tool gained the ability to use various Embedding Models or rather Embedding Model Providers in v0.3.0. GPTScript is our natural language approach to programming: prompt in English (or your native language), to write tools which are then strung together via AI to execute (view initial blog here for a refresh).
What that means is that you can now define an endless list of embedding model providers in a newly created config file and for every knowledge base (dataset) you create, you can choose any of them.
Important Note here: you can only use one embedding model per dataset, you cannot ingest one file with model A and another one with model B, as that would screw up the vector space due to different embedding types and vector dimensionalities.
As of v0.3.0 we tested the knowledge tool with the following providers and models (with no judgement of how well each of them works in terms of accuracy and performance):
-
OpenAI (default) with
text-embedding-ada-002
-
Azure OpenAI with
text-embedding-ada-002
-
LM-Studio (local) with
CompendiumLabs/bge-large-en-v1.5-gguf
- Note: This is not a built-in provider but can still be used by setting the corresponding OpenAI variables
-
Warning: At the time of testing, LM-Studio didn’t support parallel calls to the embeddings endpoint, so it was pretty slow and we had to set parallelism to
1
viaVS_CHROMEM_EMBEDDING_PARALLEL_THREAD=1
-
Cohere with
embed-english-v3.0
-
Jina with
embed-english-v3.0
-
LocalAI with
bert-cpp-minilm-v6
-
Mistral with
mistral-embed
-
Mixedbread with
all-MiniLM-L6-v2
-
Ollama with
mxbai-embed-large
Local Models
I tested the locally running models, especially via LM-Studio and Ollama on my development laptop with an i7-1260P and 64GB RAM and with Ollama the processing time of ingesting a 509 pages PDF file was about 5 minutes (this was without enabling parallelism on Ollama and I bet there are some settings I can tweak, as my laptop was using pretty few resources).
Using some other Model Provider than OpenAI
With the command-line flag --embedding-model-provider
or the related environment variable KNOW_EMBEDDING_MODEL_PROVIDER
, you can switch between configured providers, even if you didn’t define them in a config file. By default, this is set to openai
(but you could modify the OpenAI environment variables to point it to any OpenAI API compatible endpoint as well).
Now let’s say you also have access to Google Vertex AI and have all the environment variables configured (at least VERTEX_API_KEY
and VERTEX_PROJECT_ID
would be required in this case) and want to ingest into a newly created dataset using the Vertex provider.
You would do this like shown below:
export VERTEX_API_KEY="my-super-secret-key"
export VERTEX_PROJECT_ID="my-google-project"
knowledge ingest -d my-vertex-powered-dataset --embedding-model-provider=vertex path/to/some/files
# or alternatively
export KNOW_EMBEDDING_MODEL_PROVIDER="vertex"
knowledge ingest -d my-vertex-powered-dataset path/to/more/files
Configuring multiple Model Providers
Obviously, you can define the environment variables only once per provider.
Now you could use dotenv files to configure multiple settings, e.g. two different variations for Vertex, using different projects or models – or different Ollama servers, whatever.
This can be quite cumbersome… but don’t worry, here comes the YAML (you’ll love it – but you may use JSON as well) config file where you can define as many providers as you want and can give them different names.
Here you can see an example config that defines 3 different provider configs of which two are using the same provider type, which wouldn’t be possible with just environment variables:
embeddings:
providers:
- name: my-vertex # custom name
type: vertex # one of the provider types as shown in the list further up
config:
apiKey: ${SOME_VERTEX_API_KEY} # environment variables will get expanded
model: text-embedding-004
- name: ollama-1
type: ollama
config:
model: mxbai-embed-large
- name: ollama-2
type: ollama
config:
model: nomic-embed-text
With this config file we can now reference any provider by our custom name:
knowledge ingest -c /path/to/config.yaml --embedding-model-provider="ollama-1" -d my-ollama-1-dataset path/to/files
More about the Config File and Embedding Model Providers
You can find more up-to-date information on this new config file and embedding model providers that we integrated in the knowledge documentation:
Outlook
With this new setup, it will soon be possible to share knowledge bases / datasets with the embedding model provider information attached to it, finally getting rid of the hurdle that ingesting into an imported dataset may need some time figuring out exactly which model was used.
With the information about the originally used provider and model attached to the dataset, the knowledge tool just has to get your API Token (get it from env or ask you for it) and you’re setup, without any further configuration.
This will make sharing datasets and collaboratively building knowledge bases even easier!