GPT4All#

A free-to-use, locally running, privacy-aware chatbot. No GPU or internet required.

Download Desktop Chat Client#

Models#

Llama 3 Instruct
hfl/llama-3-chinese-8b-instruct-v3-gguf
shenzhi-wang/Llama3-8B-Chinese-Chat-GGUF-8bit

GPT4All API Server#

Activating the API Server#

Open the GPT4All Chat Desktop Application.
Go to Settings > Application and scroll down to Advanced.
Check the box for the "Enable Local API Server" setting.
The server listens on port 4891 by default. You can choose another port number in the "API Server Port" setting.

Connecting to the API Server#

The base URL used for the API server is http://localhost:4891/v1 (or http://localhost:<PORT_NUM>/v1 if you are using a different port number).

http://localhost:4891/v1/models

Examples#

curl -X POST http://localhost:4891/v1/chat/completions -d '{
"model": "Phi-3 Mini Instruct",
"messages": [{"role":"user","content":"Who is Lionel Messi?"}],
"max_tokens": 50,
"temperature": 0.28
}'

API Endpoints#

Method	Path	Description
GET	`/v1/models`	List available models
GET	`/v1/models/<name>`	Get details of a specific model
POST	`/v1/completions`	Generate text completions
POST	`/v1/chat/completions`	Generate chat completions

GPT4All Python SDK#

Installation#

pip install gpt4all

Load LLM#

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf") # downloads / loads a 4.66GB LLM
with model.chat_session():
    print(model.generate("How can I run LLMs efficiently on my laptop?", max_tokens=1024))

Chat Session Generation#

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
with model.chat_session():
    print(model.generate("quadratic formula"))

Direct Generation#

from gpt4all import GPT4All
model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")
print(model.generate("quadratic formula"))

Embeddings#

from nomic import embed
embeddings = embed.text(["String 1", "String 2"], inference_mode="local")['embeddings']
print("Number of embeddings created:", len(embeddings))
print("Number of dimensions per embedding:", len(embeddings[0]))

Tips#

GPT4All Python SDK#

Failed to load llamamodel-mainline-cuda-avxonly.dll: LoadLibraryExW failed with error 0x7e
Failed to load llamamodel-mainline-cuda.dll: LoadLibraryExW failed with error 0x7e

Python binding logs console errors when CUDA is not found, even when CPU is requested

GPT4All

Contents

GPT4All#

Download Desktop Chat Client#

Models#

GPT4All API Server#

Activating the API Server#

Connecting to the API Server#

Examples#

API Endpoints#

GPT4All Python SDK#

Installation#

Load LLM#

Chat Session Generation#

Direct Generation#

Embeddings#

Tips#

GPT4All Python SDK#

Runtime Environment#

Screenshots#

References#