Marvell与DPU的最新Octeon家族与Nvidia和Intel作战

July 2, 2021

Marvell正在使用其Octeon 10 DPU产品线推动信封。OCTEON 10是该类别中第一个利用TSMC的5 nm节点和ARM的N2 CPU核心的服务器处理器，使其在每瓦的性能方面具有竞争优势。

詹姆斯·莫拉（James Morra）

Related To: 电子设计

Marvell技术在2021年末将对NVIDIA，INTEL和其他半导体巨头进行反击，其最新一代的OCTEON数据处理单元（DPU）将于2021年底发行。

周一，总部位于加利福尼亚州圣克拉拉（Santa Clara）的公司表示，DPU的Octeon 10家族是基于TSMC的最新5纳米节点的芯片系统，并在服务器级处理器中首次集成了ARM的Neoverse N2核心。该组合的性能是上一代Octeon TX2的三倍，同时燃烧了50％的功率，并将数据吞吐量提高到400 Gbps。

ARM CPU与硬件中的一组加速器和其他构建块相辅相成，这些块将移动，处理，安全，存储和管理通过服务器或蜂窝基站的庞大网络传播的一系列数据。Marvell将其中的许多升级为最先进的加速器，从机器学习引擎到“矢量数据包处理”管道。它还改善了芯片中的加密加速器，以实时运行加密和解密数据（400多个GBP），而无需CPU干预。

“To meet and exceed the growing data processing requirements for network, storage, and security workloads, Marvell focused on significant DPU innovations across compute, hardware accelerators, and high-speed I/O,” John Sakamoto, vice president of Marvell’s infrastructure processors business unit, said in a statement. Marvell said the first Octeon DPU in the family will start shipping in the second half of 2021.

The DPU has become a battleground in the data center segment in recent years. Nvidia has expanded its data center ambitions with its Bluefield family of DPUs that, like Marvell's Octeon DPUs, are used to offload networking, storage, security, and other infrastructure workloads from the CPU in the server and accelerate them, saving CPU capacity for other tasks. Nvidia plans to start supplying its future Bluefield-3 DPU in 2022.

Intel is also wrestling to win market share in the category with what it calls infrastructure processing units (IPUs) which are based on FPGAs instead of the more general-purpose chips at the heart of the Octeon 10. Intel said IPUs have been developed with and deployed by Microsoft, Baidu, and other major cloud vendors. Marvell is also up against industry rivals Xilinx and Broadcom as well as startups Fungible and Pendsando.

Marvell said Octeon has become the most popular infrastructure processor in the world since it came to market a decade-and-a-half ago, with millions of chips deployed in data centers and 3G, 4G, and 5G RAN.

But it is aggressively pushing the envelope with its Octeon 10 family of DPUs. The Octeon 10 is the first server processor in its class based on on TSMC's 5-nm node and also the first to feature Arm's N2 cores, giving it a competitive edge in performance-per-watt, thus reducing the cost of cooling and powering the chips. Marvell said Octeon 10 has a wide range of industry-first features for a DPU, such as its integrated machine learning engine. The chips also contain advanced IO interfaces, including PCIe Gen 5 and DDR5.

英特尔、英伟达、和其他供应商试图反对vince Amazon, Google, Microsoft, and other cloud services players to attach DPUs to the millions of servers in the colossal data centers they use themselves and rent out to millions of clients over the cloud. But unlike its rivals, Marvell said that it is looking to stand out with more scalable platforms. The Octeon DPUs are not only targeted at cloud data centers but also wireless and wired networking gear such as switches, routers, secure gateways, firewalls and 5G base stations.

Octeon 10 DPUs也可以附加到一个d-play server networking cards also called SmartNICs. "This is a platform that is scalable from the edge out to hyperscale cloud," Sakamoto told Electronic Design.

Under the hood, the Octeon 10 DPUs have up to 36 N2 cores clocked at up to 2.5 GHz and arranged on the same slab of silicon. The N2 is the first in a family of central processing cores for the data center based on the Armv9 architecture. Designed on the Perseus microarchitecture, the Arm cores support up to 40% more instructions per clock compared to the N1 core crammed in Amazon's Graviton2 and Ampere's Altra CPUs.

For years, Marvell used its architecture license with Arm to create the CPU cores at the heart of its Octeon family of server-grade processors from the ground up. But with the Octeon 10 generation, Marvell swapped its in-house TX2 core in favor of Arm's standard infrastructure cores, giving it the freedom to spend more of its engineering resources on the accelerators and other features instead of tangling itself up in CPU design.

OCTEON 10的单线程性能最多是基于TX2的前一代的三倍。Sakamoto说：“如果将OCTEON 10与其他DPU家庭进行比较，这是明显的计算领导。”

64位内核配有64 kb的L1指令缓存以及64 kb的L1数据缓存。Marvell还将L2缓存扩大到每个核心1 MB，以减少各种工作量的延迟。芯子（也有32位硬件）也可以访问2 MB的L3缓存，在旗舰OCTEON 10 DPU中，总计高达36 MB的L2和72 MB的L3 Cache。根据Marvell的说法，它还将高级硬件调度附加到核心上，从而将延迟从CPU降低到加速器的延迟为三倍。

由于它基于ARMV9-A架构，因此CPU还集成了128位宽的处理管道，以利用ARM的SVE2指令集。SVE2技术是一系列新的指令，可提高ARM CPU的数据处理和机器学习能力。这有可能使Marvell比NVIDIA最新的Bluefield-3 DPU具有额外的优势，该bluefield-3 DPU建立在基于ARMV8.2的Cortex-A78 CPU内核上。

这些芯片还包括多达八个PCIE Gen 5车道，最多16个50G以太网端口以及多达56G的Serdes车道，从PCIE Gen 4和DDR4 DRAM Laines升级为OCTEON TX2处理器。

Surrounding the Arm CPU cores are a group of state-of-the-art accelerators and other hardware modules, including its in-house machine learning engine. According to Marvell, the module is based on a mosaic of inferencing tiles that incorporate SRAM and MACs that can run Int8 and FP16 operations. The inferencing tiles are linked to each other with a crossbar interconnect, which also attaches to shared system memory.

The machine learning engine in the Octeon 10 DPUs can be used by software to identify potential intruders traveling along the sprawling networks of servers in the cloud or corporate data centers. The AI capabilities can also be used in 4G and 5G base stations to improve beamforming technology that shoot out signals at smartphones and other devices directly instead of broadcasting over a wide area sort of like a floodlight.

Marvell said that it integrated the machine learning engine "directly in the data pipeline" to reduce latency in the system to the levels required by 5G networks and high-throughput workloads in cloud data centers. The company said the tile-based architecture means that it can scale up or down depending on the end market. More tiles equal faster performance for AI chores. The machine learning engine can scale up to more than 100 trillion operations per second (TOPS) at a cost of it occupying more die area and draining more power.

Marvell said that it upgraded the packet processing pipeline in the Octeon 10 DPUs to process more than one packet at a time, reducing system latency and improving data throughput. Marvell said that the "vector packet processing" engine can intercept data traveling through the network, group the packets together in a set, and process the complete set as a vector in hardware instead of processing all the packets one by one.

According to Marvell, the improved pipeline is able to boost packet processing throughput at the system level by a factor of five compared to the scalar processing engines in its previous Octeon TX2 generation.

Another area of improvement is the integrated 1 terabit-per-second switch. The chips incorporate a set of 16 50 gigabit-per-second Ethernet ports that are configurable from 1G up to 100G Ethernet. By integrating the switch on the same silicon chip as the hardware accelerators and other computing modules, Marvell said its Octeon 10 DPUs cut system costs by reducing the amount of hardware used in 5G base stations and edge networking switches. Other advanced features in the switch include 256-bit MACsec and TSN.

对于存储工作负载，这些芯片本地支撑了NVM Express（NVME）的每秒最多2000万个IO操作（IOPS）。

一份声明中，ARM基础设施业务高级副总裁兼总经理克里斯·贝吉（Chris Bergey）在一份声明中在一份声明中说：“处理从云到边缘设备产生的大量数据需要大量计算。”“领先的5-nm技术，Neoverse N2核心和Octeon 10的结合将使Marvell能够承担复杂的工作量，并在DPU计算中展示其优势。”

Marvell said that it would also roll out development tools to make it easier for customers to deploy software on the Octeon DPUs, which can take advantage of the Arm ecosystem because they are based on the same instruction set. The tools include software stacks for networking, storage, and security as well as support for virtual machines and containers. The chips can also support a wide range of open and standard APIs.

"Our objective on the software side is to make it very easy for users to run and accelerate applications on the Octeon 10," Sakamoto added. “Our strategy is to have an open platform. Other companies have taken this approach where they want to create a closed ecosystem by offering their own proprietary toolkits and then trying to secure some vendor lock-in. We're trying to keep things open and support open frameworks."

Marvell计划在2021年底之前推出Octeon 10 DPU家族的软件开发平台。

Marvell plans to roll out a wide range of different SKUs, ranging from the CN103xx—ideal for wireless and wired networking gear in a wide range of energy-efficient, or even fanless, form factors—to the DPU400—targeted at network infrastructure in cloud data centers. The flagship processor in the family fits in a 60W envelope, which is up to 50% more power efficient than its 80W to 120W predecesor, the CN98xx.

Marvell还在全球3G和4G基站上运送了数百万个OCTEON DPU，以及诺基亚和三星电子等Tier-1电信基础设施OEM已经为新的5G基站选择了Octeon DPU。Marvell还推出了CN106XX，最多24 N2 CPU核心运行4G和5G基站以及公司和无线运营商数据中心，同时消耗了40W至50W。

马维尔说，该家庭的第一个处理器（CN106XX）现在在TSMC的生产中，应在第四季度到早期客户。Marvell计划在2022年开始运送其他SKU。