Robot Framestock Footages Dreamstime L 135514255

PCIe Tech如何帮助构建机器学习的加速器

Designing an accelerator chip for machine-learning applications is no easy task. This article explains how PCIe technology can help vendors developing such chips improve their product’s attractiveness while minimizing risk, cost, and time to market.

Memberscan下载这s article in PDF format.


  • How to build an ML accelerator chip using PCI Express.
  • What is the training process for a machine-learning model?
  • The emergence of complex models that need multiple accelerators.

Machine-learning (ML), especially deep-learning (DL)-based solutions, are penetrating all aspects of our personal and business lives. ML-based solutions can be found in agriculture, media and advertising, healthcare, defense, law, finance, manufacturing, and e-commerce. On a personal basis, ML touches our lives when we read Google news, play music from our Spotify playlists, in our Amazon recommendations, and when we speak to Alexa or Siri.


For some markets such as the data center, these chips can be discrete ML accelerator chips. Given the addressable market, it’s no surprise that the market for discrete ML accelerators is highly competitive. In this article, we will outline how PCIe technology can be leveraged by discrete ML accelerator-chip vendors to make their product stand out in such a hyper-competitive market.


In addition to offering the best performance per watt per dollar for the widest possible set of machine-learning use cases, several capabilities serve as table stakes in the highly competitive ML accelerator market. First, the accelerator solution must be able to attach to as many compute chips as possible from different compute-chip vendors. Choosing a widely adopted chip-to-chip interconnect protocol such as the PCI Express (PCIe) as the accelerator/compute-chip interconnect solution will automatically ensure that the accelerator can attach to almost all available compute chips.


Application software must be able to use the accelerator with minimum software development effort and cost. By turning the accelerator into a PCIe device, well-known robust software methods for accessing and using PCIe devices can be instantly deployed.


The industry standard for device virtualization is PCIe technology-based: SR-IOV. In addition, direct assignment of PCIe device functions to VMs is widely supported and used due to the high performance offered. Hence, by implementing PCIe architecture for their accelerator, vendors can address market segments that need high-performance virtualized accelerators.


Before machine-learning models can be deployed in production, they must be trained. The training process for ML, especially deep learning, involves feeding in a huge number of training samples to the model that’s being trained.

In most cases, these samples need to be fetched or streamed from storage systems or the network. Therefore, the time to train to an acceptable level of accuracy will be affected by the bandwidth and latency properties of the link between the ML accelerator and storage systems or networking interfaces. The lower the time to train, the better the accelerator solution will be for customers.

An accelerator can potentially lower the time to train by using PCIe technology’s peer-to-peer traffic capability to stream data directly from the storage device or the network. Using the peer-to-peer capability this way improves performance by avoiding the round trip through the host compute system’s memory for the training samples.


The peer-to-peer capability of PCIe architecture can be useful in the inference and generation side of machine learning as well. For example, in applications like object detection in autonomous driving, a constant stream of camera outputs needs to be fed to the inference accelerator with the lowest latency as possible. In this case, peer-to-peer capability can be used to stream camera data to the inference accelerator with minimal latency.

High-bandwidth requirements for the connections between the ML accelerator chip and compute chips, storage cards, switches, and NICs necessitate high-data-rate serial transmission. As data rates increase and the distance between the chips expands or stays the same, advanced PCB materials and/or reach extension solutions, such as retimers, are required to stay within the channel insertion loss budget.


Multiple Accelerators for Complex Models


Therefore, a system with multiple accelerators becomes necessary when large models like GPT-3 are the preferred choice for the use case. In such multi-accelerator systems, the interconnect between system components needs to offer high bandwidth, be scalable, and be capable of accommodating heterogenous nodes attached to the interconnect fabric.

PCIe technology is a great option for system component interconnect due to its high bandwidth, and its ability to scale by deploying switches. As mentioned previously, the ubiquity of PCIe-based devices allows the same fabric to have NICs, storage devices, and the accelerators. This allows efficient peer-to-peer communication leading to lower time to train, lower inference latency, or higher inference throughput. For multi-accelerator use cases where a low-latency and high-bandwidth inter-accelerator interconnect is desired, the accelerator vendor can take advantage of the PCIe specification’s alternate protocol support to create a custom inter-accelerator interconnect.



The performance per dollar of an ML solution depends in part on its power efficiency. For example, an inference accelerator might use its link to the compute SoCs only when a new inference request is passed from the compute SoC to the accelerator. For the rest of time, the link is essentially idle. Unless the link has low-power idle states, it will be consuming power unnecessarily by remaining in a high-performance active state.

For maximum efficiency, it’s important that the power consumption of the accelerator’s links to the rest of system scales linearly with the utilization of those links. PCIe offers link power states like L1 and L0p to modulate the power consumption of the link based on idleness and bandwidth usage.


PCIe and RAS

人工智能数据中心部署的加速器,再保险liability, availability, and serviceability (RAS) features are necessary in all system components, including accelerators. In addition, to be usable in practice, such RAS features must comply to what standard OSes and platform firmware expects. The PCIe architecture offers a rich suite of OS-friendly RAS features, including advanced error reporting, the ability to do hot add and removal of PCIe devices, etc. Therefore, choosing PCIe-based links as the means to connect to other system components helps the accelerator product meet the data-center market’s RAS needs.

PCIe规范向AI Accelerator供应商提供的一个重要优势是能够将相同的解决方案重新结合到几个不同的市场段。这是通过利用PCIe技术的两个特征来完成的:不同的形状因素的可用性和PCIe规范接口的能力具有不同的链路宽度。这允许供应商按比例缩放接口带宽,功耗和形成范围的形式因子与市场段所需的加速度能力成比例。

Confidentiality and integrity of the data that’s transferred to and from the accelerator is important for most customers. The PCIe specification has recently introduced integrity protection and data encryption (IDE) for data transfers over PCIe links. An accelerator vendor can leverage PCIe IDE to provide end-to-end confidentiality and integrity for data.


由于ML加速器市场的竞争性,最大限度地减少市场的时间对于加速器供应商很重要。利用PCIe技术可以帮助这方面有助于提供高质量的PCIe IP,可以利用可用于快​​速芯片设计和验证。供应商可以轻松访问合规性测试服务,以确保其芯片连接到所有PCIe技术兼容的计算系统,并且他们可以访问大型PCIe架构专家。


如前所述,机器 - 学习(特别是深度学习)模型的规模和复杂性增长。为了在计算和内存容量方面跟上这种趋势,具有多个互连的加速器芯片的系统将越来越必要。芯片到Chip互连性能需要随着计算和内存容量而增大,以实现这种系统的真正性能潜力。

为芯片到Chip互连选择PCIe技术有助于供应商利用每个新一代PCIe技术带来市场的带宽增加。例如,PCIe 6.0被投影为提供64 GTransfers / s的数据速率,这是PCIe 5.0数据速率的双倍。

总之,采用PCIe规范将使ML Accelerator供应商能够:

  • Develop market-leading accelerators with lower risk and faster time to market.
  • Have a robust pathway to keep up with workload demands for scaling of chip-to-chip interconnect bandwidth.


Power Factor Correct Basics and Design Considerations

Power factor correct (PFC) basics and design considerations. This series discusses PFC basics, topology comparisons, and design considerations to achi…

DIY PMICs: User-Programmable PMICs

In the DIY PMICs: user-programmable PMIC training, we will introduce the concept of DIY PMICs, then user-programmable PMICs. One section focuses on h…

> 95%效率,1-kW模拟控制AC / DC参考设计5G电信整流器

A fully assembled board has been developed for testing and performance validation only, and is not available for sale.. Download ready-to-use system f…

Real-Time Operating Systems (RTOS) and Their Applications

What is RTOS. A Real-Time Operating System (RTOS) is a lightweight OS used to ease multitasking and task integration in resource and time constrained…


This site requires you to register or login to post a comment.