Janky AI

Notable LLMs that are Apache/MIT licensed

DeltaSqueezer — Thu, 18 Jul 2024 17:52:35 GMT

Sometimes you want LLMs that are unencumbered by non-commercial licenses. Below is a list of some notable LLMs that have friendly license agreements.

Mistral family
- Mistral 7B, Mixtral 8*7B, Mixtral 8*22B
- Mistral Nemo 12GB with quantization aware training for good FP8 performance
Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B (MoE)
Phi family
- Phi-1, Phi-1.5
- Phi-2 2.7B
- Phi-3 3.8B, 7B, 14B
Yi family
- Yi 34B, 9B and 6B
- Yi 1.5 34B, 9B, and 6B
Falcon 7B, Falcon 40B, LLM360/K2, OLMo-7B,
Neo_7B
IBM Granite models
XVERSE 7B/13B/65B
Snowflake Arctic
Grok
DeepSeek-Coder-V2
Danube 2, Danube 3 (For small models)

Reducing idle power consumption for Nvidia P100 and P40 GPUs

DeltaSqueezer — Sun, 14 Jul 2024 20:52:35 GMT

One overlooked aspect of GPU usage is the power they consume when idle. Idle power draw refers to the amount of electricity a GPU consumes when it's not performing intensive tasks. This can significantly impact both energy consumption and electricity costs over time.

Without any tricks, a P40 with VRAM loaded can burn 45W at idle. With some tweaks, this idle power can be reduced to around 10W.

Idle Power Draw: 10W vs 45W

Let's consider the impact of the difference between a 45W and 10W idle draw. While the difference might seem small at first glance, the cumulative effect over a year can be substantial.

Annual Energy Consumption

To calculate the annual energy consumption, we use the formula: Energy = Power × Time / 1000

Assuming the GPUs are idle 24 hours a day for 365 days a year, we get:

10W GPU: 10W × 24 × 365 / 1000 = 88 kWh
45W GPU: 45W × 24 × 365 / 1000 = 394 kWh

Annual Cost of Electricity

The cost of electricity can vary substantially from place to place, but where I live it is approximately $0.25 per kWh. Which then gives the annual costs as follows:

GPU Idle Power Draw (W)	Annual Energy Consumption (kWh)	Annual Cost ($)
10	88	$22.00
45	394	$99.00

Table 1: Annual cost comparison of P40 idling at 10W vs 45W

The difference in idle power draw between 10W and 45W might seem minor on a per-second basis, but over the span of a year, it results in significant energy consumption and cost differences, especially when you put multiple GPUs in a system.

P40 idle state quirks

The P40 has only P0 and P8 states and idle draw can be as low as 10W when VRAM is empty, but the P40 seems to have a quirk when content is loaded into VRAM: the power draw can be 45W even when the GPU is performing no work.

Luckily, there are ways to work around this and reduce idle power draw by directly adjusting pstates.

Reducing idle power draw by directly adjusting pstates

A library and CLI utilities to manage pstates here:

GitHub - sasha0552/nvidia-pstate: A library and CLI utilities for managing performance states of NVIDIA GPUs.

A library and CLI utilities for managing performance states of NVIDIA GPUs. - sasha0552/nvidia-pstate

GitHubsasha0552

and daemon:

GitHub - sasha0552/nvidia-pstated: A daemon that automatically manages the performance states of NVIDIA GPUs.

A daemon that automatically manages the performance states of NVIDIA GPUs. - sasha0552/nvidia-pstated

GitHubsasha0552

Patches to automatically drop pstates while idle for llama.cpp and vLLM are available here:

ToriLinux/airootfs/home/tori/.local/share/tori/patches at main · sasha0552/ToriLinux

Linux LiveCD for offline AI training and inference. - sasha0552/ToriLinux

GitHubsasha0552

There's also a separate project that aims to do something similar called gppm, which aims to handle multiple cards and llama.cpp instances independently.

P100 has no pstates

The P100 is a datacentre GPU that was originally designed for training workloads. Since the target workload aimed at continuous maximum utilization, these GPUs have no low power pstates.

Even at idle with no data loaded into VRAM, these can consume just under 30W of idle power. Put four of them in a server and you have 120W of idle power just for the GPUs.

GPU Idle Power Draw (W)	Annual Energy Consumption (kWh)	Annual Cost ($)
120	1,051	$263.00

Table 2: Annual cost of running 4xP100s at idle power

Given this power profile, you would choose P100s if:

You expect to have high utilization with little idle time
You want to run computations in batches and will turn off the server when batches are done
You want the server to double as a space heater or have money to burn

Since the P100 is not very popular for home use due to this idle power issue and having only 16GB of VRAM compared to the P40's 24GB, the prices of P100s on the 2nd hand market have remained relatively low even as the P40 prices have skyrocketed.

But what if...

One last possibility of power saving is to mount the GPU onto a riser to enable disconnection of power to the GPU and then performing PCIe hot-unplugging to save power. This could theoretically save power at the expense of start-up latency.

Getting PCIe hot-plugging to work on consumer grade hardware maybe challenging and frustrating (massive understatement alert).

What about operating power?

Idle power is only only one aspect, see this article on how to manage active power to maximize efficiency:

Power limiting RTX 3090 GPU to increase power efficiency

I plotted this chart and thought I’d share it in case it was useful to others. It is the tok/s output at different power limits with a RTX 3090 during single-inferencing. While maximum efficiency is achieved around 211W, this reduces output by around 20% Running between 260W-280W gives good

Janky AIDeltaSqueezer

Cooling GPUs

One final challenge with re-purposing these datacenter GPUs for home use is that the cards do not have active cooling, instead relying on forced air cooling from the server.

Cooling these cards also has certain factors to be considered and is not straight forward - or at least not if you don't want hairdryer levels of screaming fans in the server. Subscribe using the link below to get our guide on the options for cooling these GPUs while retaining your sanity!

Readers' comments

Thanks for the inspiration.

I just updated someone else's repo (PR pending approval) to give .net control of the same API that nvidia_pstate is using because unfortunately the python script didn't enumerate my Tesla GPUs.

Here's my fork of the .net wrapper: https://github.com/maz-net-au/NvAPIWrapper

You can control it like this: (8 is for P8, use 16 to restore the default, auto-switching mode)
PhysicalGPUHandle[] handles = GPUApi.EnumTCCPhysicalGPUs();
foreach (PhysicalGPUHandle ph in handles)
{
   GPUApi.SetForcePstate(ph, 8, 2); // the 2 is from nvidia_pstate python script
}
I'm keeping the units at P8 and watching for GPU utilization, allowing P0 for 2 mins after the last poll detected utilisation above 10%. I.e. as soon as you start inference, I allow the cards to switch to P0 and if unused for a couple of minutes, it forces them back to P8.

My frankenstien's monster of a Dell R720XD has 2x Tesla P40's and 2x Tesla T4's in it and if I leave llama.cpp and ComfyUI both running, just the idle P0 power usage heats up the compute units and runs the chassis fans at 80%. This is all a convaluted fix for the issue of not wanting to piss off my wife with the soothing hum of server fans.

Sometimes when you don't have 340 GB of VRAM

DeltaSqueezer — Fri, 12 Jul 2024 22:21:04 GMT

You just have to resort to running on your computer with 12 sticks of 32GB RAM!

How many GPUs do you want to cram into your box? Yes.

DeltaSqueezer — Tue, 09 Jul 2024 17:45:43 GMT

Custom case, or special server cards to fit 4 GPUs into a case? No need, we'll just squash them in there.

Congrats to stonedoubt for this tetris-like feat and great thermal density!

Behold my dumb sh*t 😂😂😂
by u/stonedoubt in LocalLLaMA

4xV100 SXM Build

DeltaSqueezer — Mon, 08 Jul 2024 08:03:39 GMT

This build is several years old so the prices quoted are much higher than today. I've been very interested in V100 SXM builds as more of these come onto the market and prices fall. I've not yet pulled the trigger on such a build as prices were still a little too high for my liking and parts are also tricky to source and some of the sources look dubious, but I'll fo sure be keeping my eye on this.

GitHub - l4rz/building-a-poor-mans-supercomputer: I’ve built a 4x V100 box for less than $5,500.

I’ve built a 4x V100 box for less than $5,500. Contribute to l4rz/building-a-poor-mans-supercomputer development by creating an account on GitHub.

GitHubl4rz

Lots of great janky takeaways from this build including the DIY heatsink and discussion on avoiding paying for a $350 precision torque wrench by using your fingers to tighten the screws.

Pay $100 for a heatsink? No! We'll make our own!

Power limiting RTX 3090 GPU to increase power efficiency

DeltaSqueezer — Thu, 27 Jun 2024 09:15:49 GMT

I plotted this chart and thought I'd share it in case it was useful to others. It is the tok/s output at different power limits with a RTX 3090 during single-inferencing. While maximum efficiency is achieved around 211W, this reduces output by around 20%

Running between 260W-280W gives good energy savings while maintaining nearly maximum output.

While this gives a good rule of thumb, the actual numbers will vary with the model used an particularly if batch inferencing instead of single-inferencing.

Why power limit a GPU?

Why would you voluntarily leave performance on the table when you paid a lot of money for a GPU? There are several reasons:

The most important reason is that you don't need to leave a lot of performance on the table, the default power limits on consumer GPUs tried to squeeze the last drops of performance out of the GPU even at the expense of much higher power consumption.

By dropping performance by low single-digit percentage points, you can save double digit percentage points of power.
Reducing peak and sustained power consumption means that you will not need as powerful and expensive PSU to power the GPUs.

In some cases where otherwise multiple-PSUs are required, this can potentially eliminate additional PSUs or allow you to use cheaper and lower power rated PSUs which saves on costs and reduces complexity.

Code and Data

I had a request to share the data for the chart and I share the data and chart plotting code below:

import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit

# 3090 Power data
growth_data = [(100, 18), (125, 20), (150, 42), (175, 64), (187, 76), (200, 83), (225, 91), (250, 97), (265, 98), (275, 98), (280, 99), (285, 101), (300, 101), (325, 102), (350, 103), (375, 104)]

# Convert data to numpy arrays
x = np.array([t[0] for t in growth_data])
y = np.array([t[1] for t in growth_data])
yo = np.array([t[1]/t[0] for t in growth_data])

# Define the Gompertz function
def gompertz(x, a, b, c):
    return a * np.exp(-b * np.exp(-c * (x)))

def gompertzx(x, a, b, c):
    return a * np.exp(-b * np.exp(-c * (x))) / x

# Initial guess for parameters
p0 = [100, 0.1, 0.01]

# Fit the curve
popt, pcov = curve_fit(gompertz, x, y, p0)
#popt2, pcov2 = curve_fit(gompertzx, x, y, p0)

# Print the parameters of the fitting curve
print("Fitting parameters:")
print("a =", popt[0])
print("b =", popt[1])
print("c =", popt[2])

# Calculate the maximum value of y
x_fit = np.linspace(x.min(), x.max(), 10000)
y_fit = gompertz(x_fit, *popt)
yo_fit = gompertzx(x_fit, *popt)
max_yo = np.max(yo_fit)
max_xo = x_fit[np.argmax(yo_fit)]

# Plot the data and the fitted curve
fig, ax1 = plt.subplots()

ax1.plot(x, y, 'ko')
ax1.plot(x_fit, y_fit, 'r-', label='Gompertz fit')
ax1.set_xlabel('Power (Watts)')
ax1.set_ylabel('Output (tok/s)', color='r')
ax1.tick_params('y', colors='r')

ax2 = ax1.twinx()
ax2.plot(x_fit, yo_fit, 'b-', label='Efficiency')
ax2.set_ylabel('tok/s/W', color='b')
ax2.tick_params('y', colors='b')

ax2.plot(x, gompertzx(x,*popt), 'bo')

# Indicate the maximum value of y
ax2.plot([max_xo, max_xo], [0, max_yo], 'k--', label='Max Eff.')
ax2.annotate(f'Max efficiency: {max_xo:.0f}W', xy=(max_xo, max_yo), xytext=(max_xo+5, 0.25))

fig.tight_layout()
plt.title('RTX3090 output vs power')
fig.legend(loc=(0.6,0.2))

plt.show()

Code generated with the help of LLMs!

Fitting parameters:

a = 104.54941090829679
b = 23.474054254669152
c = 0.022347470077472967

Fitted coefficients

Idle power

Peak and sustained power is just one side of the equation and can help increase efficiency and reduce initial purchase costs as well as create a simpler and more compact AI server by reducing the number of PSUs required.

However there are two other things to consider:

Controlling idle power consumption; and
How to power multiple high performance GPUs in a single server in an efficient way.

If you'd like to see these articles, subscribe and get alerted when these follow-up articles become available.