Tensordyne tapes out Napier on TSMC 3nm, claiming log math beats Nvidia token efficiency

The AI chip startup says its hardware turns multiply-accumulate into log-and-add, targeting up to 17x more tokens per watt.

ByLama Al-RashidTechnology Correspondent, The Executives Brief

about 7 hours ago·5 min read

Tensordyne tapes out Napier on TSMC 3nm, claiming log math beats Nvidia token efficiency

Executive summary

Tensordyne has taped out its first commercial accelerator, Napier, with fabrication on TSMC's 3nm process underway. The company, cofounder Gilles Backhus says, aims to improve performance per watt versus Nvidia by using logarithm-based math with Mitchell approximation and hardware correction.

Tensordyne just taped out its first commercial accelerator, Napier, and it’s already in fabrication on TSMC’s 3nm process. The bet is audacious in a very specific way: instead of leaning harder on brute-force multiply-accumulate like most AI accelerators, Tensordyne says it can change the economics of matrix math by using logarithms. Co-founder Gilles Backhus told The Register that the core move is making multiplication essentially behave like an addition problem: a*b becomes log(a) + log(b), as long as the chip can efficiently convert values into log space and convert results back.

That framing matters because it plugs directly into the KPI executives actually care about. Tensordyne claims its rack systems will deliver up to 17x more tokens per watt and 13x higher throughput than Nvidia’s Blackwell systems. Whether those numbers hold in production will depend on the unglamorous stuff, like real workloads, software pipelines, and error behavior. But the underlying claim is simple and testable: reduce computational intensity, then reduce power, and the “tokens per watt” metric moves with it.

To understand why this is more than a math flex, remember the basic hardware trade. In conventional computing, addition is cheap and multiplication is expensive. Logarithms flip that, because once you’re in log space, multiplication turns into additions. The catch is that this doesn’t come for free. Converting values to logs and back requires approximation, and approximation introduces error. Tensordyne says it looked at the “easy” approach first: lookup tables (LUTs). But Backhus said relying on LUTs would have been too large to be practical.

So the company went with a heuristic approximation for log and antilog: the Mitchell approximation. That still isn’t accurate enough by itself for the job, according to Backhus. The company’s answer is hardware: a section-wise correction mechanism designed to deliver accuracy equivalent to FP16. In other words, Tensordyne is not just saying “we do logs.” It’s saying “we do logs, but we patch the approximation errors in hardware to get back to a target accuracy level.”

And if you’re tracking the direction of travel in AI datacenter design, you’ll notice the number of ways this could matter. Lower power per operation is one thing. Lower power per generated token at rack scale is the thing that changes capex and operating costs. Tensordyne describes Napier as a chip where the multiply-accumulate (MAC) unit works without actually doing multiplication in the conventional sense. That is a strong statement, but it also explains why the company’s claims focus on power efficiency rather than raw peak throughput.

The specs Tensordyne put around Napier also place it in the “high-end GPU class” conversation. Napier has a 300-watt nominal TDP, 144 GB of HBM3e split across four stacks, 4.7 TB/s memory bandwidth, and up to 2.1 petaFLOPS of dense FP8 performance. The Register notes it is roughly comparable to Nvidia’s H200 accelerators announced in 2023 while using nearly 60% less power. As always with accelerator marketing, the caveat is that peak FLOPS often don’t represent real-world performance, and the actual comparison against Nvidia or AMD’s newest gen will have to wait.

Where Tensordyne tries to differentiate even further is in how it plans to scale. Backhus says the strategy leans on scalability of the accelerators rather than individual chip performance. Each chip features roughly a terabyte of interconnect bandwidth, enabling rack-scale deployments of up to 72 accelerators per pod. The TDN72 system, codenamed TDN72, consists of eight air-cooled compute blades. Each blade includes a single 10-core Intel Xeon-D host CPU and nine Napier accelerators.

The interconnect topology is described as reminiscent of Nvidia’s GB200 NVL72 rack systems, using an all-to-all fabric. Each chip connects to six proprietary fabric switch blades developed with Juniper, located at the back of the system. The key operational difference claimed by Tensordyne is size and cooling: the TDN72 will be much smaller than Nvidia’s NVL72 and will not require liquid cooling, which should make it easier to deploy in older “brownfield” data centers. Backhus says up to four 30 kW TDN72 systems can fit into a 52U rack, totaling 608 petaFLOPS in a 120 kW footprint, or about 1.68x more dense FP8 compute per rack than Nvidia’s GB200 NVL72. The article also flags that Nvidia’s kit supports NVFP4 acceleration while Napier is limited to FP4 weights, but it cautions not to over-read peak-compute comparisons.

Now for the part that can quietly kneecap a hardware innovation: software. Tensordyne says it has worked to keep its software platform simple for customer deployment. It also points out that its prototype silicon lacked error correction found in Napier, and that the prototype would have required quantization-aware training to adapt models for accuracy on the hardware, which would be impractical for those trying to run trillion-parameter models. The company says the software has matured so that its hardware compiler can convert existing models to run on the latest hardware. It’s a pattern other chip startups have used.

For inference, Tensordyne has a proprietary serving platform plus a runtime environment that Backhus says will allow customers to use their preferred inference servers such as vLLM. PyTorch support is described as under development. Before shipping, Tensordyne also makes bold inference claims, expecting upwards of 1,000 tokens a second without relying on multi-token prediction or speculative decoding.

Finally, there’s timing and competitive pressure. Tensordyne’s Napier is slated for Q2 or Q3 of 2027, so the company has a window to prove that the mathematics, the error-correction approach, and the system-level design all work together. In the meantime, Tensordyne is already drawing interest from neocloud providers like Cirrascale and BlueSky Compute. But as with AMD and others, software compatibility can decide whether a chip becomes a deployment or a curiosity. Competing systems are also coming, including Nvidia’s next-gen Vera Rubin and Vera Rubin Ultra. For executives evaluating AI infrastructure bets, the real question is not whether log math is clever. It’s whether Tensordyne can deliver the promised tokens per watt at rack scale, with software that customers can actually run, before the next competitive wave arrives.

Executive ActionsLocked

This story's Key Insights and Take-aways are locked.

Create a free account to unlock Executive Actions for one credit.

Always free for Executives Club members. Join the Club

Taggedtensordyne napier tsmc-3nm ai-accelerators log-math inference hbm3e juniper-networks nvidia-blackwell data-center-cooling

Tensordyne tapes out Napier on TSMC 3nm, claiming log math beats Nvidia token efficiency

This story's Key Insights and Take-aways are locked.

More in Technology

Aura’s e-ink photo frame makes “digital” feel old-fashioned again

NASA’s ERNEST rover hits 16 miles in 37 hours, 10x Mars speed

Fitness trackers can work on tattooed skin, but the right tech decides