llama.cpp

Building the tools

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

Four executables are created:

main
quantize
perplexity
embedding

main

Options

		param value
`-h`	`--help`		Show this help message and exit
`-i`	`--interactive`		Run in interactive mode
	`--interactive-first`		Run in interactive mode and wait for input right away
	`-ins`, `--instruct`		Run in instruction mode (use with Alpaca models)
`-r`	`--reverse-prompt`	`PROMPT`	Run in interactive mode and poll user input upon seeing `PROMPT` (can be specified more than once for multiple prompts).
	`--color`		Colorise output to distinguish prompt and user input from generations
`-s`	`--seed`	`SEED`	Seed for random number generator (default: `-1`, use random seed for <= 0)
`-t`	`--threads`	`N`	Number of threads to use during computation (default: 12)
`-p`	`--prompt`	`PROMPT`	Prompt to start generation with (default: empty)
	`--random-prompt`		Start with a randomized prompt.
	`--in-prefix`	`STRING`	String to prefix user inputs with (default: empty)
`-f`	`--file`	`FNAME`	Prompt file to start generation.
`-n`	`--n_predict`	`N`	Number of tokens to predict (default: 128, -1 = infinity)
	`--top_k`	`N`	Top-k sampling (default: 40)
	`--top_p`	`N`	Top-p sampling (default: 0.9)
	`--repeat_last_n`	`N`	Last n tokens to consider for penalize (default: 64)
	`--repeat_penalty`	`N`	Penalize repeat sequence of tokens (default: 1.1)
`-c`	`--ctx_size`	`N`	Size of the prompt context (default: `512`)
	`--ignore-eos`		Ignore end of stream token and continue generating
	`--memory_f32`		Use `f32` instead of `f16` for memory key+value
	`--temp`	`N`	Temperature (default: `0.8`)
	`--n_parts`	`N`	Number of model parts (default: -1 = determine from dimensions)
`-b`	`--batch_size`	`N`	Batch size for prompt processing (default: 8)
	`--perplexity`		Compute perplexity over the prompt
	`--keep`		Number of tokens to keep from the initial prompt (default: 0, -1 = all)
	`--mlock`		Force system to keep model in RAM rather than swapping or compressing
	`--mtest`		Determine the maximum memory usage needed to do inference for the given `n_batch` and `n_predict` parameters (uncomment the `"used_mem"` line in `llama.cpp` to see the results)
	`--verbose-prompt`		Print prompt before generation
`-m`	`--model`	`FNAME`	Model path (default: `models/llama-7B/ggml-model.bin`)

Links

ggml is a tensor library, written in C, that is used in llama.cpp. In fact, the description of ggml reads: Note that this project is under development and not ready for production use. Some of the development is currently happening in the llama.cpp and whisper.cpp repos

Python bindings for llama.cpp provides

Low-level access to C API via ctypes.