TextSynth uses custom inference code to get faster inference on GPUs and CPUs. It has the following characteristics:
A CPU-only version is freely available.
Performance using the GPT-Neox 20B model on a RTX A6000 Nvidia GPU. For the speed measurement, 200 tokens are generated using a batch size of 1:
Precision | LAMBADA (ppl) | LAMBADA (acc) | Max GPU memory (GB) | Speed (tokens/s) |
---|---|---|---|---|
float16 | 3.66 | 72.6% | 40.7 | 15 |
8 bits | 3.66 | 72.6% | 21.7 | 27 |
4 bits | 3.71 | 72.0% | 11.6 | 41 |
Performance using the Stable diffusion 1.4 model on a RTX A6000 Nvidia GPU. For the speed measurement, a single image is generated using 50 timesteps and a batch size of 1:
Precision | Max GPU memory (GB) | Generation time (s) |
---|---|---|
float16 | 2.8 | 1.90 |