Spotlight on Tech

AI-first data center operations drive new priorities

By
Subha Shrinivasan
Senior Vice President, Global Services Division
Rakuten Symphony
May 21, 2026
7
minute read

As Head of Rakuten Symphony’s Global Services Division, I manage our global data center (DC) assets that support the infrastructure requirements for internal business units and external customers.

Our “IaaS” business builds, operates and provisions clusters that run our home-grown products. Traditionally, operations were optimized around legacy-first principles, with monitoring of node-level metrics such as CPU health, memory, disks, and server- and network-level observability.  

We have seen the change in priority first internally and after the industry has prioritized energy, thermal efficiency and lossless east-west traffic. Our AI workloads—while just a fraction of our overall workloads—quickly exposed observability gaps.

Anyone intending to run a GPU-heavy data center tomorrow, needs to know how to optimize operations today. The stakes are high. Once the first AI data center is live, any inefficiency in operations quickly turns into a budget shock. And because operations don’t stop at the software stack or the underlying infrastructure, they need to be approached holistically, with attention to the metrics that truly drive performance.

Let’s explore five new metrics that we are obsessing about today, even for traditional data centers, including what is changing and needs to be reconsidered in light of new data.  

1. Power Usage Efficiency

Power Usage Efficiency is the industry standard for measuring data center energy efficiency, calculated as the ratio of total facility electrical power to IT equipment power. A PUE of 1.0 is considered perfect. Traditional DCs operate at a PUE of 1.5–1.8.

We were tracking PUE as a pure energy metric and optimizing for saving on power consumption to get to the desired ~1.2 range. The key insight we later brought in was that PUE is also indirectly impacted by utilization and poor electrical infrastructure. Very often, metrics are impacted by several undercurrents.

The shift: PUE moved from a facilities metric to a top-tier operational KPI, focusing on driving up utilization.  

Here is how we plan to do it:

  1. Drive up server utilization, which correlates with the numbers we are seeing in the chart below.
  1. Upgrade electrical infrastructure, prioritizing UPS efficiency, distribution losses and redundant conversions.

2. Inlet temperature and thermal throttling

Inlet temperature is the air temperature right at the front of the server, where cooling air is pulled in. It is the temperature that matters most to IT equipment versus ambient temperature in another part of the hall. It is also the best indicator of the “thermal wall.”  

Currently, high intake temperatures force internal fans to consume more power and eventually trigger thermal throttling.  

A throttled processor delivers less compute than paid for. We are using Hot Aisle Containment (HAC) to enclose the hot aisle with doors, roof panels and a controlled return path, so exhaust air goes straight back to the cooling units.

Fig. Inlet Temperature and Thermal Throttling with Host Aisle Containment (HAC)
Fig. Inlet temperature and thermal throttling with Host Aisle Containment (HAC)

The shift: Monitoring the air entering the rack is now a mandatory operational discipline to prevent silent performance degradation.

As a next step, the new data center expansion plan that is being built for the next wave of demand is optimized for a CER of above 1.0 from the start.  

We are evaluating liquid cooling technologies, including direct-to-chip and immersion cooling, because they enable server-level cooling directly at the processor, creating the potential for better ROI.

3. Allocation vs. actual utilization

We run at an allocation above 90%, but this often masks "capacity hoarding." The more important metric is actual utilization, which also directly affects PUE.  

Teams hold infrastructure they aren't using because holding capacity guarantees allocation. When we started monitoring utilization trends over allocation trends, the picture changed completely. We identified a clear gap between permanent capacity allocations and actual usage: infrastructure remained reserved even when utilization stayed below the 70% target, with only occasional random spikes in demand.

The shift:  We turned our obsessive monitoring towards utilization, starting by immediately rolling out our in-house capability to turn off processors when idle.  

We now use C-state energy-saving modes to put processors to sleep during idle periods, and we categorize workloads into two pools: hardware-dependent workloads and pure software-only stacks.

For workloads in the latter category, if CPU utilization trends below a certain threshold, we are inclining towards reclaiming the capacity while committing to return any cluster within 24 hours. We support that commitment with in-house golden images, preconfigured IP pools, and infrastructure that can be redeployed through infrastructure-as-code.

We are transitioning from static ownership to dynamic allocation based on real-time demand and keeping all servers utilized at above 60%

4. Intelligent radio sleep

Our labs are radio-heavy because they support the full release QA process, including interoperability validation and performance testing. However, radios are only needed for the last mile of testing. Because these lab radios do not serve commercial traffic, they can be turned off when not in use, creating another opportunity to conserve power. Leveraging our RAN Intelligent Controller (RIC), we can also simply sleep the radios during low-traffic periods.

The shift: We are moving away from “always on” as a default given high energy costs. By implementing programmatic sleep policies for radio units and reducing wattage, we can manage budgets without compromising availability.

The rollout of RIC-driven radio micro-sleeps is planned for Q3 of this year. This has already been implemented in the Rakuten Mobile network and we are now pushing to adopt this in for our in-house DC as well.

5. Network fabric: East-West traffic

Traditional DC networks are optimized for North-South traffic (users talking to servers) via hierarchical, oversubscribed designs. Modern distributed workloads—like the interoperability testing and high-density analytics we run and the microservices from radio to the core—generate massive East-West traffic (node-to-node).

The challenge this introduces is legacy three-tier architectures (access/aggregation/core) create "tromboning" or "hairpinning," where lateral data is forced up to the core and back down, causing unpredictable latency spikes. We usually see tail-latency (P99) spikes and "incast" congestion where multiple nodes burst data to a single target simultaneously, overrunning switch buffers.

The shift: We are moving toward a non-blocking Spine-Leaf fabric.  

This ensures a consistent, low-hop count (Leaf → Spine → Leaf) for every node-to-node path, providing the deterministic bandwidth required for synchronized distributed tasks. It also guarantees QoS with no one switch becoming a single point of failure for the entire network. The re-design has enabled us to deliver on lower traffic latencies and also drive our performance testing close to live network performance.

Fig. Data Center East-West Optimized Traffic
Fig. Data center East-West optimized traffic

Conclusion

Traditional DC management is reactive. An AI-ready DC must be predictive, so we can monitor which wastes and which throttles.

We are intentionally moving from reactive observability (i.e., monitoring what breaks) to efficiency-driven observability (i.e., monitoring what wastes and throttles).  

This shift toward power, thermal and utilization is the difference between infrastructure that enables scale and infrastructure that constrains it. At 500+ nodes, the new observability metrics now go beyond the software stack and the node-specific metrics. It involves rethinking our data centers as a single, interconnected system that must operate efficiently at all levels, from the cable to the node to the switch to the software.

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
AI
Symphony
Automation
Data

Subscribe to Covered, a Newsletter for Modern Telecom

You are signed up!

Thank you for joining. You are now a part of the Rakuten Symphony community. As a community member, you will receive news, announcements, updates, insights and information in our eNewsletter.
How can we help?
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
This website uses cookies to enhance user experience and to analyse performance and traffic on our website. We also share information about your use of our site with our social media, advertising, and analytics partners. Please see our “Privacy Policy” for more information.