In a statement, Positron AI claims that its Atlas accelerator outperforms Nvidia's H200 during inference, consuming only 33% of the power while processing 280 tokens per second per user with Llama 3.1 8B within a 2000W power envelope.
In a significant development for the AI industry, Positron AI, a U.S.-based company founded in 2023, is making waves with its Atlas accelerator. The device, designed specifically for inference tasks, is reported to outperform Nvidia's DGX H200 system in terms of power efficiency and inference performance.
According to Positron AI, the Atlas accelerator delivers approximately 280 tokens per second per user on Llama 3.1 8B with BF16 precision while consuming around 2000W of power. This is in contrast to Nvidia's 8-way DGX H200 system, which manages roughly 180 tokens per second at 5900W power consumption. This means Atlas uses practically only 33% of the power but delivers roughly 1.5x higher token throughput, making it about three times more efficient in performance per watt and performance per dollar.
The DGX H200, a high-end, infrastructure-grade AI chip, consumes up to 700W per GPU and is designed with advanced power management features. However, an 8-way DGX H200 server scales power usage accordingly, around 5900W, with cutting-edge performance delivered at large scale but with significantly higher power draw.
While Nvidia focuses on robustness, redundancy, and general-purpose AI acceleration, Positron AI's Atlas is designed from scratch for highly efficient inference workloads only, optimizing power and cost without the overhead of training or other HPC tasks.
In summary:
| Aspect | Positron AI Atlas | Nvidia DGX H200 8-way | |---|---|---| | Power Consumption | ~2000W total | ~5900W total | | Inference Performance (Llama 3.1 8B, BF16) | ~280 tokens/sec/user | ~180 tokens/sec/user | | Performance per Watt | ~3x that of DGX H200 | Baseline | | Design Focus | Inference accelerator specialized | General-purpose AI GPU accelerator | | Power Management | Lower total power, purpose-built efficiency | Advanced power management with redundancy and telemetry |
Hyperscale cloud service provider Cloudflare is among the early adopters currently testing Positron AI's Atlas solution for AI inference. The power consumption of AI data centers used for inference is also increasing at a rapid pace due to the growing use of AI.
Positron AI has raised a recent $51.6 million funding round, led by Valor Equity Partners, Atreides Management, and DFJ Growth. The company is also developing its 2nd Generation AI inference accelerator, Asimov, expected in 2026. The Asimov AI accelerator will have 2 TB of memory per ASIC and will not use HBM memory. It will also feature a 16 Tb/s external network bandwidth for efficient operations in rack-scale systems.
The AI industry's power demands are raising concerns as some massive clusters used for AI model training consume the same amount of power as cities. Positron AI's Atlas, with its focus on power efficiency, could play a significant role in addressing this issue. The company manufactures its ASIC hardware at TSMC's Fab 21 in Arizona and assembles the cards in the U.S., making them an almost entirely American product.
Sources: 1. [Link to Source 1] 2. [Link to Source 2] 3. [Link to Source 3]
Positron AI's Atlas accelerator, designed for inference tasks, showcases impressive power efficiency and inference performance, delivering approximately 280 tokens per second per user on Llama 3.1 8B with BF16 precision while consuming around 2000W of power. This is significantly more cost-effective than Nvidia's 8-way DGX H200 system, which consumes approximately 5900W, yet offers a lower inference performance of roughly 180 tokens per second. Further, with the increasing power consumption of AI data centers, Positron AI's focus on data-and-cloud-computing and technology, particularly artificial-intelligence, could potentially revolutionize the finance and investing sectors by reducing power demands and costs associated with AI inference.