GPTScript Knowledge Tool v0.3 Introduces One-Time Configuration for Embedding Model Providers

Jul 31, 2024 by Thorsten Klein
GPTScript Knowledge Tool v0.3 Introduces One-Time Configuration for Embedding Model Providers

OK, the title may be a little exaggerated, but still, our little Knowledge tool gained the ability to use various Embedding Models or rather Embedding Model Providers in v0.3.0. GPTScript is our natural language approach to programming: prompt in English (or your native language), to write tools which are then strung together via AI to execute (view initial blog here for a refresh).

What that means is that you can now define an endless list of embedding model providers in a newly created config file and for every knowledge base (dataset) you create, you can choose any of them.

Important Note here: you can only use one embedding model per dataset, you cannot ingest one file with model A and another one with model B, as that would screw up the vector space due to different embedding types and vector dimensionalities.

As of v0.3.0 we tested the knowledge tool with the following providers and models (with no judgement of how well each of them works in terms of accuracy and performance):

  • OpenAI (default) with
    text-embedding-ada-002
  • Azure OpenAI with
    text-embedding-ada-002
  • LM-Studio (local) with
    CompendiumLabs/bge-large-en-v1.5-gguf
    • Note: This is not a built-in provider but can still be used by setting the corresponding OpenAI variables
    • Warning: At the time of testing, LM-Studio didn’t support parallel calls to the embeddings endpoint, so it was pretty slow and we had to set parallelism to
      1
      via
      VS_CHROMEM_EMBEDDING_PARALLEL_THREAD=1
  • Cohere with
    embed-english-v3.0
  • Jina with
    embed-english-v3.0
  • LocalAI with
    bert-cpp-minilm-v6
  • Mistral with
    mistral-embed
  • Mixedbread with
    all-MiniLM-L6-v2
  • Ollama with
    mxbai-embed-large

Local Models

I tested the locally running models, especially via LM-Studio and Ollama on my development laptop with an i7-1260P and 64GB RAM and with Ollama the processing time of ingesting a 509 pages PDF file was about 5 minutes (this was without enabling parallelism on Ollama and I bet there are some settings I can tweak, as my laptop was using pretty few resources).

Using some other Model Provider than OpenAI

With the command-line flag

--embedding-model-provider
or the related environment variable
KNOW_EMBEDDING_MODEL_PROVIDER
, you can switch between configured providers, even if you didn’t define them in a config file. By default, this is set to
openai
(but you could modify the OpenAI environment variables to point it to any OpenAI API compatible endpoint as well).

Now let’s say you also have access to Google Vertex AI and have all the environment variables configured (at least

VERTEX_API_KEY
and
VERTEX_PROJECT_ID
would be required in this case) and want to ingest into a newly created dataset using the Vertex provider. You would do this like shown below:

export VERTEX_API_KEY="my-super-secret-key" export VERTEX_PROJECT_ID="my-google-project" knowledge ingest -d my-vertex-powered-dataset --embedding-model-provider=vertex path/to/some/files # or alternatively export KNOW_EMBEDDING_MODEL_PROVIDER="vertex" knowledge ingest -d my-vertex-powered-dataset path/to/more/files

Configuring multiple Model Providers

Obviously, you can define the environment variables only once per provider. Now you could use dotenv files to configure multiple settings, e.g. two different variations for Vertex, using different projects or models - or different Ollama servers, whatever.

This can be quite cumbersome… but don’t worry, here comes the YAML (you’ll love it - but you may use JSON as well) config file where you can define as many providers as you want and can give them different names.

Here you can see an example config that defines 3 different provider configs of which two are using the same provider type, which wouldn’t be possible with just environment variables:

embeddings: providers: - name: my-vertex # custom name type: vertex # one of the provider types as shown in the list further up config: apiKey: ${SOME_VERTEX_API_KEY} # environment variables will get expanded model: text-embedding-004 - name: ollama-1 type: ollama config: model: mxbai-embed-large - name: ollama-2 type: ollama config: model: nomic-embed-text

With this config file we can now reference any provider by our custom name:

knowledge ingest -c /path/to/config.yaml --embedding-model-provider="ollama-1" -d my-ollama-1-dataset path/to/files

More about the Config File and Embedding Model Providers

You can find more up-to-date information on this new config file and embedding model providers that we integrated in the knowledge documentation:

Outlook

With this new setup, it will soon be possible to share knowledge bases / datasets with the embedding model provider information attached to it, finally getting rid of the hurdle that ingesting into an imported dataset may need some time figuring out exactly which model was used. With the information about the originally used provider and model attached to the dataset, the knowledge tool just has to get your API Token (get it from env or ask you for it) and you’re setup, without any further configuration.

This will make sharing datasets and collaboratively building knowledge bases even easier!