Breaking the Scale-Out Barrier: Zero-Degradation AI Inference Optimization on a 34-Accelerator Single-Server Node

Commissioned by an internationally known NeoCloud provider, I/ONX recently concluded a rigorous series of extreme concurrency and isolation benchmarks. The findings redefine the boundaries of hardware density and what is possible within a single-server footprint.
These standard benchmark tests are designed for apples-to-apples comparisons across architectures rather than extreme tuning of kernels.
Yes, we know 34 accelerators on a single server is not normal. The industry standard eight accelerators dominate the market with little variance, but what if that paradigm is long overdue for a reset when it comes to inference and fine-tuning?
Large-scale training is bandwidth bound, but LLM inference scaling and fine-tuning deserve their own, optimized ecosystem.
In this blog post, we review a recent study I/ONX was commissioned to conduct to prove out the impact of a completely re-imagined hardware and software orchestration approach where a single server runs multiple large language models (LLMs) across dozens of accelerators. These accelerators are deployed in a headless chassis such that zero memory, storage, or CPU on the local chassis were used for the tests.
Executive Overview: Density as the Ultimate Cost-Saver
As enterprises and cloud providers push to operationalize increasingly massive Large Language Models, the industry’s default response has been "scale-out"—buying endless racks of conventional 8-GPU servers and wiring them together using fragile, expensive InfiniBand or high-speed Ethernet fabrics.
Recently, an internationally known NeoCloud provider commissioned I/ONX to prove an alternative theory: What if we could scale-up instead?
The mission was to validate whether the I/ONX orchestration platform could maintain enterprise-grade throughput and rock-solid workload isolation on a severely overloaded, single-node architecture. The test machine was configured with an unprecedented 34 disparate AI inference optimization accelerators on a single bare-metal host.
The results were resounding. The I/ONX system successfully served an enormous Qwen3.5-397B-A17B (FP8) model under maximum stress with virtually zero performance degradation. For executives and infrastructure leaders, this unlocks a paradigm shift in data center economics:
Massive CapEx Reduction: By consolidating workloads into ultra-dense nodes, operators can bypass the "host tax" entirely, saving up to $1.5M per inference cluster.
Operational Simplicity: Removing multi-node network fabric requirements translates to significant power, software, and human capital savings, projecting up to $2M in OpEx savings over 3 years.
Vendor Fluidity: Proven, unified orchestration of AMD, Tenstorrent, and Furiosa silicon operating simultaneously inside the exact same server.
For AI engineering teams, this eliminates multi-node distributed network headaches and consolidates infrastructure management. Below is a detailed look at how the tests were conducted and the engineering value of the 34-XPU single-server configuration.
The Technical Challenge: Beating the "Noisy Neighbor"
When you crowd 34 accelerators onto a single motherboard PCIe fabric, conventional wisdom says you will cripple performance. Shared resources— such as CPU threading, PCIe lanes, RAM bandwidth, and thermal envelopes inevitably collide. Traditional platforms struggle with "noisy neighbor" degradation, where engaging a secondary accelerator grinds the primary workload's inter-token latency to a halt.
To stress-test our data-plane isolation, the NeoCloud requested a test topology far beyond standard operational boundaries. We ran standard InferenceX benchmark profiles to ensure verifiable, apples-to-apples metrics under extreme hardware saturation.
Test Environment & Topology Parameters
Target Model: Qwen/Qwen3.5-397B-A17B-FP8
Inference Engine: SGLang (v0.5.9-rocm720-mi30x)
Attention Backend: Triton (AITER compiled)
Standard InferenceX Workload:
Tensor Parallelism (TP): 8
Input Sequence Length (ISL): 8192 | Output (OSL): 1024
Max Concurrency & Max Running Requests (MRR): 64
Total Prompts: 640 (burst profiling)
The single bare-metal server was populated with:
16x AMD MI300X GPUs (Configured as two distinct 8-GPU 'Hives')
16x Tenstorrent Accelerators
2x Furiosa NPUs
Benchmark Execution & Results
The tests were executed in three progressive phases to measure absolute peak throughput followed by the impact of topological saturation.
Phase 1: Single Hive Baseline
To establish the native peak throughput, we ran the test strictly on Hive 0 (8x MI300X) without any secondary workloads active on the host.
Output Token Throughput: 822.10 tok/s
Median Time per Output Token (TPOT): 74.22 ms
Conclusion: A highly efficient, baseline response demonstrating the pure uninhibited monolithic performance of the MI300X serving a 397B parameter model.
Full results from benchmark run:
============ Serving Benchmark Result (Single Hive Baseline) ============
Successful requests: 640
Benchmark duration (s): 717.59
Total input tokens: 4727544
Total generated tokens: 589927
Request throughput (req/s): 0.89
Output token throughput (tok/s): 822.10
Total Token throughput (tok/s): 7410.19
---------------Time to First Token----------------
Mean TTFT (ms): 1831.07
Median TTFT (ms): 492.33
P99 TTFT (ms): 16809.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 74.41
Median TPOT (ms): 74.22
P99 TPOT (ms): 103.85
---------------Inter-token Latency----------------
Mean ITL (ms): 74.64
Median ITL (ms): 44.37
P99 ITL (ms): 411.67
----------------End-to-end Latency----------------
Mean E2EL (ms): 70525.29
Median E2EL (ms): 69131.28
P99 E2EL (ms): 113614.71
==================================================
Phase 2: Dual Hive Isolation
We then spun up Hive 1 (8x MI300X), placing it under the exact same traffic load concurrently with Hive 0. This tested the ability of horizontal isolation within the same physical environment.
Hive 0 Throughput: 813.02 tok/s
Hive 1 Throughput: 804.64 tok/s
Conclusion: Minimal to zero noisy neighbor penalty. Running two massive Qwen instances side-by-side in distinct logical hives yielded completely isolated pipelines without tripping over host CPU resources.
Full results from the benchmark run:
============ Serving Benchmark Result (Hive 0 - 8 GPUs) ============
Successful requests: 640
Benchmark duration (s): 725.60
Total input tokens: 4727544
Total generated tokens: 589927
Request throughput (req/s): 0.88
Output token throughput (tok/s): 813.02
Total Token throughput (tok/s): 7328.37
---------------Time to First Token----------------
Mean TTFT (ms): 1862.12
Median TTFT (ms): 499.86
P99 TTFT (ms): 16794.80
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 75.23
Median TPOT (ms): 74.96
P99 TPOT (ms): 106.71
---------------Inter-token Latency----------------
Mean ITL (ms): 75.44
Median ITL (ms): 44.47
P99 ITL (ms): 416.67
----------------End-to-end Latency----------------
Mean E2EL (ms): 71326.99
Median E2EL (ms): 69820.55
P99 E2EL (ms): 116628.58
==================================================
============ Serving Benchmark Result (Hive 1 - 8 GPUs) ============
Successful requests: 640
Benchmark duration (s): 733.16
Total input tokens: 4727544
Total generated tokens: 589927
Request throughput (req/s): 0.87
Output token throughput (tok/s): 804.64
Total Token throughput (tok/s): 7252.86
---------------Time to First Token----------------
Mean TTFT (ms): 1837.17
Median TTFT (ms): 502.75
P99 TTFT (ms): 17224.04
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 76.06
Median TPOT (ms): 75.36
P99 TPOT (ms): 104.74
---------------Inter-token Latency----------------
Mean ITL (ms): 76.28
Median ITL (ms): 45.30
P99 ITL (ms): 420.28
----------------End-to-end Latency----------------
Mean E2EL (ms): 72058.27
Median E2EL (ms): 70300.94
P99 E2EL (ms): 115969.41
==================================================
Phase 3: The Extreme Stress Test (34 Accelerators Active)
In the final and most aggressive scenario, we initiated maximum topological stress. While Hive 0 and Hive 1 continued to serve their 64-concurrency Qwen workloads, the I/ONX orchestrator fully engaged the remaining 16 Tenstorrent chips and 2 Furiosa NPUs with competing tasks on the same server.
Hive 0 Throughput: 805.12 tok/s
Hive 1 Throughput: 789.46 tok/s
Median Inter-Token Latency (ITL): Remaining steadfast at ~44.5 ms and ~45.5 ms respectively.
Conclusion: Rock-solid data-plane stability. Under the absolute maximum thermal, PCIe, and core-thread stress conceivable for a single host, performance degradation remained incredibly narrow.
Full results from the benchmark run:
============ Serving Benchmark Result (Hive 0 - 8 GPUs) ============
Successful requests: 640
Benchmark duration (s): 732.72
Total input tokens: 4727544
Total generated tokens: 589927
Request throughput (req/s): 0.87
Output token throughput (tok/s): 805.12
Total Token throughput (tok/s): 7257.17
---------------Time to First Token----------------
Mean TTFT (ms): 1889.11
Median TTFT (ms): 503.85
P99 TTFT (ms): 17041.93
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 75.96
Median TPOT (ms): 74.31
P99 TPOT (ms): 111.15
---------------Inter-token Latency----------------
Mean ITL (ms): 76.21
Median ITL (ms): 44.52
P99 ITL (ms): 437.30
----------------End-to-end Latency----------------
Mean E2EL (ms): 72045.14
Median E2EL (ms): 69805.16
P99 E2EL (ms): 120829.38
==================================================
============ Serving Benchmark Result (Hive 1 - 8 GPUs) ============
Successful requests: 640
Benchmark duration (s): 747.26
Total input tokens: 4727544
Total generated tokens: 589927
Request throughput (req/s): 0.86
Output token throughput (tok/s): 789.46
Total Token throughput (tok/s): 7115.99
---------------Time to First Token----------------
Mean TTFT (ms): 1726.90
Median TTFT (ms): 531.85
P99 TTFT (ms): 15273.40
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 77.70
Median TPOT (ms): 74.51
P99 TPOT (ms): 111.08
---------------Inter-token Latency----------------
Mean ITL (ms): 78.01
Median ITL (ms): 45.50
P99 ITL (ms): 478.71
----------------End-to-end Latency----------------
Mean E2EL (ms): 73473.21
Median E2EL (ms): 70946.56
P99 E2EL (ms): 116931.40
==================================================
The Engineering Value of a Single-Server Configuration
For infrastructure engineers and ML Ops architects, these findings suggest a radical simplification of AI deployments for inference and fine-tuning:
Skipping the Network Layer: The hardest part of scaling Large Language Models is optimizing cross-rack fabric (managing RoCEv2, InfiniBand topologies, network partition errors, and switch latency). By fitting 64 concurrent streams of a 397B parameter model into a single server, you bypass network hop latency entirely. The data never leaves the PCIe bus.
Unified Fleet Orchestration: Typically, managing AMD, Tenstorrent, and Furiosa chips requires distinct hardware silos grouped by vendor, each demanding unique deployment workflows. I/ONX handles the translation, runtime, and resource allocation inside one unified OS layer, allowing developers to target the single server natively.
Absolute Workload Isolation: The extreme stress tests prove that diverse silicon types can successfully co-habit the same motherboard without stepping on one another's memory or PCIe bandwidth.
Ultimately, the benchmarks we delivered to this NeoCloud partner prove that the easiest way to scale inference and fine-tuning out is to scale… UP! By centralizing management and relying on the intense single-node AI infrastructure capabilities of I/ONX orchestration, engineering teams can deploy models faster, cheaper, and with far less complexity on ultra-dense, single-server nodes.
For more information, check out our press release on the I/ONX Symphony SixtyFour
Blog Home