Untether AI introduces speedAI architecture
At this week’s HotChips 2022, Untether AI has announced its second generation at-memory computation architecture for AI workloads. speedAI architecture delivers 2Petaflops of performance at 30TFLOPs per W.
It is designed to meet the neural network demands to use AI in a variety of markets, from financial technology, smart city and retail, natural language processing, autonomous vehicles, and scientific applications. These demanding applications require increasing levels of accuracy to ensure safety and quality of results, said the company.
Untether AI’s second generation speedAI architecture enhances the energy efficiency, throughput, accuracy, and scalability which is claimed to be unmatched by any other inference offering available today.
At-memory compute is significantly more energy efficient than traditional von Neumann architectures, said the company, with more TFlops performed for a given power envelope.
The speedAI architecture dramatically improves upon the first generation (runAI) by delivering 30TFLOPs per watt. This energy efficiency is a product of the second generation atmemory compute architecture, over 1,400 optimised RISC-V processors with custom instructions, energy efficient dataflow, and the adoption of a new FP8 datatype, which quadruples efficiency compared to runAI.
The first member of the family, the speedAI240 device provides 2PetaFlops of FP8 performance and 1 PetaFlop of BF16 performance. This translates into industry leading performance and efficiency on neural networks like BERT-base, which speedAI240 can run at over 750 queries per second per watt, 15x greater than the current state of the art from leading GPUs, said Untether AI.
Each memory bank of the speedAI architecture has 512 processing elements with direct attachment to dedicated SRAM. These processing elements support INT4, FP8, INT8, and BF16 datatypes, along with zero-detect circuitry for energy conservation and support for 2:1 structured sparsity. Arranged in eight rows of 64 processing elements, each row has its own dedicated row controller and hardwired reduce functionality to allow flexibility in programing and efficient computation of transformer network functions such as Softmax and LayerNorm. The rows are managed by two RISC-V processors with over 20 custom instructions designed for inference acceleration. The flexibility of the memory bank allows it to adapt to a variety of neural network architectures, including convolutional, transformer, and recommendation networks as well as linear algebra models.
Two FP8 formats are claimed to provided the best mix of precision, range, and efficiency. A 4-mantissa version (FP8p for “precision”) and a 3-mantissa version (FP8r for “range”) were found to provide the best accuracy and throughput for inference across a variety of different networks. For both convolutional networks like ResNet-50 and transformer networks like BERT-Base, Untether AI’s implementation of FP8 results in less than 1/10th of one per cent of accuracy loss compared to using BF16 data types, with a four fold increase in throughput and energy efficiency.
The speedAI240 device is designed to scale to large models. The memory architecture is multi-leveled, with 238Mbytes of SRAM dedicated to the processing elements offering 1 Petabyte per second memory bandwidth, four 1MB scratchpads, and two 64-bit wide ports of LPDDR5, providing up to 32Gbyte of external DRAM. Host and chip-to-chip connectivity is provided by high-speed PCIExpress Gen5 interfaces.
The imAIgine software development kit provides a path to running networks at high performance, with push-button quantisation, optimisation, physical allocation, and multi-chip partitioning. The imAIgine SDK also provides an extensive visualisation toolkit,
cycle-accurate simulator, and a runtime API and is available now.
The speedAI devices will be offered as standalone chips, m.2 and PCI-Express form factor cards. Sampling is expected to begin in the first half of 2023.