Skip to main content

It is no secret that GPUs are critical for artificial intelligence and deep learning applications since their highly-efficient architectures make them ideal for compute-intensive use cases. However, almost everyone who has used them is also aware of the fact they tend to be expensive! In this article, we hope to show that while the per-hour cost of a GPU might be greater, it might in fact be cheaper from a total cost-to-solution perspective. Additionally, your time-to-insight is going to be substantially lower, potentially leading to additional savings. In this benchmark, we compare the runtimes and the cost-to-solution for 8 high-performance GPUs with 2 CPU-only cluster configurations that are available on the Databricks platform, for an NLP application.

Why are GPUs beneficial?

GPUs are ideally suited to this task since they have a substantial number of compute units with an architecture designed for number crunching. For example, the A100 Nvidia GPU has been shown to be about 237 times faster than CPUs on the MLPerf benchmark (https://blogs.nvidia.com/blog/2020/10/21/inference-mlperf-benchmarks/). Specifically, for deep learning applications, there has been quite a bit of work done to create mature frameworks such as Tensorflow and Pytorch that allows the end-users to take advantage of these architectures. Not only are the GPUs designed for these compute-intensive tasks, but the infrastructure surrounding it, such as NVlink interconnects for high-speed data transfers between GPU memories. The NCCL library allows one to perform multi-GPU operations over the high-speed interconnects so that these deep learning experiments can scale over thousands of GPUs. Additionally, NCCL is tightly integrated into the most popular deep learning frameworks.

While GPUs are almost indispensable for deep learning, the cost-per-hour associated with them tends to deter customers. However, with the help of the benchmarks used in this article  I hope to illustrate two key points:

  • Cost-of-solution - While the cost-per-hour of a GPU instance might be higher, the total cost-of-solution might, in fact, be lower.
  • Time-to-insight - With GPUs being faster, the time-to-insight, is usually much lower due to the iterative nature of deep learning or data science. This in turn can result in lower infrastructure costs such as the cost of storage.

The benchmark

In this study, GPUs are used to perform inference in a NLP task, or more specifically sentiment analysis over a text set of documents. Specifically, the benchmark consists of inference performed on three datasets

  1. A small set of 3 JSON files
  2. A larger Parquet
  3. The larger Parquet file partitioned into 10 files

The goal here is to assess the total runtimes of the inference tasks along with variations in the batch size to account for the differences in the GPU memory available. The GPU memory utilization is also monitored to account for runtime disparities. The key to obtaining the most performance from GPUs is to ensure that all the GPU compute units and memory are sufficiently occupied with work at all times.

The cost-per-hour of each of the instances tested are listed and we calculate the total inference cost in order to make meaningful business cost comparisons. The code used for the benchmark is provided below.

MODEL = "distilbert-base-uncased-finetuned-sst-2-english"

def get_all_files():
  partitioned_file_list = glob.glob('/dbfs/Users/[email protected]/Peteall_partitioned/*.parquet')
  file_list = ['/dbfs/Users/[email protected]/Peteall.txt']
  if(USE_ONE_FILE == True):
    return(file_list)
  else:
    return(partitioned_file_list)


class TextLoader(Dataset):
    def __init__(self, file=None, transform=None, target_transform=None, tokenizer=None):
        self.file = pd.read_parquet(file)
        self.file = self.file
        self.file = tokenizer(list(self.file['full_text']), padding=True, truncation=True, max_length=512, return_tensors='pt')
        self.file = self.file['input_ids']
        self.transform = transform
        self.target_transform = target_transform

    def __len__(self):
        return len(self.file)

    def __getitem__(self, idx):
        data = self.file[idx]
        return(data)

      
class SentimentModel(nn.Module):
    # Our model

    def __init__(self):
        super(SentimentModel, self).__init__()
        #print("------------------- Initializing once ------------------")
        self.fc = AutoModelForSequenceClassification.from_pretrained(MODEL)

    def forward(self, input):
        #print(input)
        output = self.fc(input)
        pt_predictions = nn.functional.softmax(output.logits, dim=1)
        #print("\tIn Model: input size", input.size())
        return(pt_predictions)
      

dev = 'cuda'
if dev == 'cpu':
  device = torch.device('cpu')
  device_staging = 'cpu:0'
else:
  device = torch.device('cuda')
  device_staging = 'cuda:0'
  
tokenizer = AutoTokenizer.from_pretrained(MODEL)

all_files = get_all_files()
model3 = SentimentModel()
try:
      # If you leave out the device_ids parameter, it selects all the devices (GPUs) available
      model3 = nn.DataParallel(model3) 
      model3.to(device_staging)
except:
      torch.set_printoptions(threshold=10000)

t0 = time.time()
for file in all_files:
    data = TextLoader(file=file, tokenizer=tokenizer)
    train_dataloader = DataLoader(data, batch_size=batch_size, shuffle=False) # Shuffle should be set to False
    out = torch.empty(0,0)
    for ct,data in enumerate(train_dataloader):
        input = data.to(device_staging)
        if(len(out) == 0):
          out = model3(input)
        else:
          output = model3(input)
          with torch.no_grad():
            out = torch.cat((out, output), 0)
            
    df = pd.read_parquet(file)['full_text']
    res = out.cpu().numpy()
    df_res = pd.DataFrame({ "text": df, "negative": res[:,0], "positive": res[:,1]})
    #print(df_res)
print("Time executing inference ",time.time() - t0)

The infrastructure -  GPUs & CPUs

The benchmarks were run on 8 GPU clusters and 2 CPU clusters. The GPU clusters consisted of the K80s (Kepler), T4s (Turing) and the V100s (Volta) GPUs in various configurations that are available on Databricks through the AWS cloud backend.  The instances were chosen with different configurations of compute and memory configurations. In terms of pure throughput, the Kepler architecture is the oldest and the least powerful while the Volta is the most powerful.

The GPUs

  1. G4dn

These instances have the NVIDIA T4 GPUs (Turing) and Intel Cascade Lake CPUs. According to AWS ‘They are optimized for machine learning inference and small scale training’. The following instances were used:

NameGPUsMemoryPrice
g4dn.xlarge116GB$0.071
g4dn.12xlarge4192GB$0.856
G4db.16xlarge1256GB$1.141
  1. P2

These have the K80s (Kepler) and are used for general purpose computing.

NameGPUsMemoryPrice
p2.xlarge112GB$0.122
p2.8xlarge896GB$0.976
  1. P3

P3 instances offer up to 8 NVIDIA® V100 Tensor Core GPUs on a single instance and are ideal for machine learning applications. These instances can offer up to one petaflop of mixed-precision performance per instance. The P3dn.24xlarge instance, for example, offers 4x the network bandwidth of P3.16xlarge instances and can support NCCL for distributed machine learning.

NameGPUsGPU MemoryPrice
p3.2xlarge116GB$0.415
p3.8xlarge464GB$1.66
p3dn.24xlarge8256GB$4.233

CPU instances

C5

The C5 instances feature the Intel Xeon Platinum 8000 series processor (Skylake-SP or Cascade Lake) with clock speeds of up to 3.6 GHz. The clusters selected here have either 48 or 96 vcpus and either 96GB or 192GB of RAM. The larger memory allows us to use larger batch sizes for the inference.

NameCPUsCPU MemoryPrice
c5.12x4896GB$0.728
c5.24xlarge96192GB$1.456

Benchmarks

Test 1

Batch size is set to be 40 times the total number of GPUs in order to scale the workload to the cluster. Here, we use the single large file as is and without any partitioning. Obviously, this approach will fail where the file is too big to fit on the cluster.

InstanceSmall dataset (s)Larger dataset (s)Number of GPUsCost per hourCost of inference (small dataset)Cost of inference (large dataset)
G4dn.x19.3887NA1$0.0710.0003NA
G4dn.12x11.9705857.66374$0.8560.0030.204
G4dn.16x20.03172134.08581$1.1410.0060.676
P2.x36.10573449.90121$0.1220.0010.117
P2.8x11.1389772.06958$0.9760.0030.209
P3.2x10.2323622.40611$0.4150.0010.072
P3.8x7.1598308.24104$1.660.0030.142
P3.24x6.7305328.66028$4.2330.0080.386

As expected, the Voltas perform the best followed by the Turings and the Kepler architectures. The runtimes also scale with the number of GPUs with the exception of the last two rows. The P3.8x cluster is faster than the P3.24x in spite of having half as many GPUs. This is due to the fact that the per-GPU memory utilization is at 17% on the P3.24x compared to 33% on the P3.8x.

Test 2

Batch size is set to be 40 times the number of GPUs available in order to scale the workload for larger clusters. The larger file is now partitioned into 10 smaller files. The only difference from the previous result table are the highlighted columns corresponding to the larger file.

InstanceSmall dataset (s)Larger dataset (s)Number of GPUsCost per hourCost of inference (small)Cost of inference(large)
G4dn.x19.38872349.58161$0.0710.00030.046
G4dn.12x11.9705979.20814$0.8560.0030.233
G4dn.16x20.03172043.22311$1.1410.0060.648
P2.x36.10573465.66961$0.1220.0010.117
P2.8x11.1389831.78658$0.9760.0030.226
P3.2x10.2323644.31091$0.4150.0010.074
P3.8x7.1598350.50214$1.660.0030.162
P3.24x6.7305395.68568$4.2330.0080.465

Test 3

In this case, the batch size increased to 70 and the large file is partitioned into 10 smaller files. In this case, you would notice that the P3.24x cluster is faster than the P3.8x cluster because the per-GPU utilization is much higher on the P3.24x compared to the previous experiment.

InstanceSmall dataset (s)Larger dataset (s)Number of GPUsCost per hourCost of inference (small dataset)Cost of inference (large dataset)
G4dn.x18.69051702.39431$0.0710.00040.034
G4dn.12x9.8503697.93994$0.8560.0020.166
G4dn.16x19.06831783.33611$1.1410.0060.565
P2.x35.8419OOM1$0.1220.001NA
P2.8x10.3589716.15388$0.9760.0030.194
P3.2x9.6603647.38081$0.4150.0010.075
P3.8x7.5605305.88794$1.660.0030.141
P3.24x6.0897258.2598$4.2330.0070.304

Inference on CPU-only clusters

Here we run the same inference problem, but only using the smaller dataset this time on cpu-only clusters. Batch size is selected as 100 times the number of vcpus.

InstanceSmall dataset (s)Number of vcpusRAMCost per hourCost of inference
C5.12x42.4914896$0.728$0.009
C5.24x40.77196192$1.456$0.016

You would notice that for both clusters, the runtimes are slower on the CPUs but the cost of inference tends to be more compared to the GPU clusters. In fact, not only is the most expensive GPU cluster in the benchmark (P3.24x) about 6x faster than both the CPU clusters, but the total inference cost ($0.007) is less than even the smaller CPU cluster (C5.12x, $0.009).

Conclusion

There is a general hesitation to adopt GPUs for workloads due to the premiums associated with their pricing, however, in this benchmark we have been able to illustrate that there could potentially be cost savings to the user from replacing CPUs with GPUs. The time-to-insight is also greatly reduced, resulting in faster iterations and solutions which can be critical for GTM strategies.

Check out the repository with the notebooks and the notebook runners on Github.

Try Databricks for free

Related posts

See all Engineering Blog posts