Search notes:

llama.cpp

The goal of llama.cpp is to run the LLaMA model on a MacBook with a C/C++ only implementation.

Building the tools

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
Four executables are created:

main

Options

param value
-h --help Show this help message and exit
-i --interactive Run in interactive mode
--interactive-first Run in interactive mode and wait for input right away
-ins, --instruct Run in instruction mode (use with Alpaca models)
-r --reverse-prompt PROMPT Run in interactive mode and poll user input upon seeing PROMPT (can be specified more than once for multiple prompts).
--color Colorise output to distinguish prompt and user input from generations
-s --seed SEED Seed for random number generator (default: -1, use random seed for <= 0)
-t --threads N Number of threads to use during computation (default: 12)
-p --prompt PROMPT Prompt to start generation with (default: empty)
--random-prompt Start with a randomized prompt.
--in-prefix STRING String to prefix user inputs with (default: empty)
-f --file FNAME Prompt file to start generation.
-n --n_predict N Number of tokens to predict (default: 128, -1 = infinity)
--top_k N Top-k sampling (default: 40)
--top_p N Top-p sampling (default: 0.9)
--repeat_last_n N Last n tokens to consider for penalize (default: 64)
--repeat_penalty N Penalize repeat sequence of tokens (default: 1.1)
-c --ctx_size N Size of the prompt context (default: 512)
--ignore-eos Ignore end of stream token and continue generating
--memory_f32 Use f32 instead of f16 for memory key+value
--temp N Temperature (default: 0.8)
--n_parts N Number of model parts (default: -1 = determine from dimensions)
-b --batch_size N Batch size for prompt processing (default: 8)
--perplexity Compute perplexity over the prompt
--keep Number of tokens to keep from the initial prompt (default: 0, -1 = all)
--mlock Force system to keep model in RAM rather than swapping or compressing
--mtest Determine the maximum memory usage needed to do inference for the given n_batch and n_predict parameters (uncomment the "used_mem" line in llama.cpp to see the results)
--verbose-prompt Print prompt before generation
-m --model FNAME Model path (default: models/llama-7B/ggml-model.bin)

Links

ggml is a tensor library, written in C, that is used in llama.cpp. In fact, the description of ggml reads: Note that this project is under development and not ready for production use. Some of the development is currently happening in the llama.cpp and whisper.cpp repos
Python bindings for llama.cpp provides

Index