Scaling Up Silicon Photonic-based Accelerators: Challenges and Opportunities

Digital accelerators in the latest generation of CMOS processes support multiply and accumulate (MAC) operations at energy efficiencies spanning 10-to-100~fJ/Op. But the operating speed for such MAC operations are often limited to a few hundreds of MHz. Optical or optoelectronic MAC operations on today's SOI-based silicon photonic integrated circuit platforms can be realized at a speed of tens of GHz, leading to much lower latency and higher throughput. In this paper, we study the energy efficiency of integrated silicon photonic MAC circuits based on Mach-Zehnder modulators and microring resonators. We describe the bounds on energy efficiency and scaling limits for NxN optical networks with today's technology, based on the optical and electrical link budget. We also describe research directions that can overcome the current limitations.

(Dated: 09 September 2021) Digital accelerators in the latest generation of CMOS processes support multiply and accumulate (MAC) operations at energy efficiencies spanning 10-to-100 fJ/Op. But the operating speed for such MAC operations are often limited to a few hundreds of MHz. Optical or optoelectronic MAC operations on today's SOI-based silicon photonic integrated circuit platforms can be realized at a speed of tens of GHz, leading to much lower latency and higher throughput. In this paper, we study the energy efficiency of integrated silicon photonic MAC circuits based on Mach-Zehnder modulators and microring resonators. We describe the bounds on energy efficiency and scaling limits for N × N optical networks with today's technology, based on the optical and electrical link budget. We also describe research directions that can overcome the current limitations.

I. INTRODUCTION:
Vector matrix multiplication operations represents the core of artificial neural networks (ANNs) and other computing applications of hardware accelerators. ANNs are realized in digital complementary metal-oxide-semiconductor (CMOS) circuits with multiple processing elements implementing multiply and accumulate (MAC) operations which calculates the product of two numbers and adds the result to an accumulator 1 . The processing elements can be arranged in a systolic architecture, where data is passed through connected processing elements in a rhythmic sequence, to perform MACs either spatially or temporally over several clock cycles 2 .
Integrated silicon photonics (SiP) circuits have been popularly employed in high-speed links to move data at a rate of tens of Gb/s, where optical modulation is more efficient than electronic switching for transmitting data over significant distances 3 . Optical modulation can be realized using Mach Zehnder modulators (MZMs) or microring modulators (MRMs) 4 . MZMs are broadband and easily support complex modulation schemes 3 . MRMs have significantly smaller footprint and driver power consumption 5 . As a technology, the current generation of SiP has now matured with high volume shipments for datacenter transceivers from companies such as Intel and Cisco. SiP circuits comprising of Mach Zehnder interferometers (MZIs) or microring resonators (MRRs) have been used also for other applications such as high-speed optical switches and filters [6][7][8][9][10] .
SiP is also being used for computing applications [11][12][13][14][15][16] , where devices such as MZIs, MZMs, MRRs, and MRMs are used for computation in optical analog domain. These encompass inference and training accelerators used for machine learning and neuromorphic computing applications where convolution takes 80% of the total processing time [17][18][19] . Other integrated optical configurations implemented using fieldprogrammable photonic arrays were shown to carry out linear transformations for signal processing and control 20,21 . Linear transformation circuits are also employed in Ising machines 22 and photonic quantum computing processors 23 .
In this Perspective, we describe the advantages and challenges of implementing MACs using SiP, and comment on how to address them. The paper is organized as follows: Section II and III describe the link budget, energy efficiency and scaling opportunities for SiP MACs implemented with MZM and MRM, respectively. Section IV explores the possible approaches to further improve SiP MAC systems. It introduces the ongoing research in the field of SiP that when fully realized, will lead to significant changes in the field of optical computing and communication. Section V concludes the paper.

II. MZM BASED SI-PHOTONIC IMPLEMENTATION
A. System Architecture Fig. 1 illustrates an MZM-based SiP implementation of an optical accelerator. Using off-chip lasers, light is guided by a polarization-maintaining (PM) single mode fiber (SMF), gets coupled to the SiP chip via an edge coupler and then split to N parts. These parts are modulated by an array of N MZI modulators and fed into an N × N weight transformation (multiplication) matrix, W N×N . The optical intensities at the MAC outputs, Y N×1 , are thus described by the multiplication product of the input vector, V N×1 and the weight matrix, W N×N , as: In other words, the weight matrix performs linear transformation for the input vector, V N×1 , and delivers N outputs, Y N×1 , that are routed to an array of photodetectors (PDs). These PDs are then connected to the electrical components, including TIAs, main amplifiers, and sense amplifiers based comparators, which are either built in a separate CMOS/BiCMOS chip, or monolithically integrated with the SiP devices in the same process.
Singular value decomposition (SVD) is an effective approach to represent a given matrix as a factorization of multiple matrices 24,25 . SVD decomposes a real matrix into a product of unitary matrices and a diagonal matrix. This is useful in the experimental realization of an N × N matrix topology in which the sequential product of rotation matrices represents the sequential arrangement of linear transformation units in the overall matrix grid 26 . A 2 × 2 linear transformation unit in the whole grid arrangement is practically implemented using a tunable beam splitter (TBS) as seen in Fig. 1 6,27 . A TBS comprises of an MZI with a phase shifter (θ ) in at least one of the arms, along with either an outer phase shifter (φ ) or a tunable directional coupler 28,29 . The transfer function for a single TBS can be described using the matrix representations for ideal 50:50 beam splitters and lossless phase shifters as: Thus, any arbitrary light redistribution can be obtained by changing θ and φ . The SVD decomposition of the weight matrix can be described as: where D is a diagonal matrix, and T m,n represents the transformation matrix for a 2 × 2 node between two input terminals, m and n, within the N × N multiplication matrix 27 , given as:

B. Optical Network Link Budget
The photonic components shown in Fig. 1 are simulated in Cadence Spectre for 8 × 8 and 32 × 32 transformation matrix sizes to verify the optical link budget analysis. The optical components are modeled in Verilog-A to enable electronicsphotonics co-simulation 30,31 . The models of some of the components used in this paper are derived from 31 , with some modifications to account for the laser electrical power consumption, the wall plug efficiency, η W PE , link losses, etc. The optical power of laser is set to 0 dBm to estimate the optical power at the output terminals of the network, incident on the PDs. Coupling light to the SiP chip introduces loss in the range of 0.6-to-3 dB based on the coupling scheme used 32,33 . The overall coupling loss from the laser to the SiP chip is collectively estimated as 1.6 dB considering possible optimizations in the coupling efficiency. 1.6 dB is also a realistic estimate for photonic wire bonds (PWBs), an emerging technology which involves writing three-dimensional waveguides in a photosensitive polymer. PWBs have demonstrated efficient interfacing between the external sources to the silicon waveguide with coupling losses as low as 0.4 dB up to 1.7 dB [34][35][36][37][38] .
For an input vector size of N, light passes through log 2 N splitters before modulation, resulting in a total insertion loss of 10log 10 N + EL splitter · log 2 N dB, where EL splitter is the estimated excess loss for a single splitter and ranges between 0.01 dB to 0.5 dB [39][40][41][42][43] . The estimated excess loss for beam splitters and combiners in this study is 0.01 dB. The overall attenuation in the silicon waveguide is a function of the depth of the network. The MZM based implementation is based on Clement's arrangement which is composed of beam splitters and phase shifters that can be programmed to implement linear transformation 27 . With an MZM representing one node in Clements arrangement 27 , the waveguide attenuation can be approximated as Nη wg L MZI , where η wg represents the optical intensity attenuation in the Si waveguide and L MZI is the length of an MZM arm, with chosen values of 3 dB/cm and 0.5 mm respectively. The insertion loss introduced by an MZM's PN phase shifter is approximated as 1 dB/mm 41 .
The insertion loss through a node is dependent on the excess loss for cross and through states, EL cross and EL thru , respectively, both of which are typically <1 dB 44,45 . For simplicity, two phase shifters connected by 3 dB adiabatic directional couplers (EL DC ∼ 0.1 dB) 39 are assumed in this work for analysis.
To study the optical attenuation of the 8 × 8 MZM implementation shown in Fig. 1 and verify its functionality, all inputs, except for the uppermost terminal (m = 1), are driven by V π voltage that creates a π phase shift difference between their MZM's arms and null their outputs. As light is set to propagate through the uppermost input terminal, the optical depth is defined by the route passing through the diagonal TBS nodes with i = j.
For a rectangular mesh arrangement, the matrix optical depth is equal to N with a total number of N(N−1) 2 optical crossings 26,27 . Hence, the 8 × 8 matrix implementation shown in Fig. 1, has an optical depth of 8 with 28 crossings.
The total optical link budgets are calculated based on (5), where P SMF−att , P EC−IL , P Si−att , P splitter−IL,EL , P PS−IL , P DC−IL and P penalty represent the attenuation introduced by the SMF fiber, fiber to chip coupling loss, silicon waveguide attenuation, splitter insertion and excess loss, phase shifters' insertion loss, total adiabatic coupling insertion loss and network penalty, respectively. The network penalty takes into account further impairments due to extinction ratio, cross talk, intersymbol interference (ISI) and laser relative intensity noise (RIN) which is caused by the random spontaneous emission over time 4,46,47 . Fig. 2 shows the calculated optical power throughout N × N networks with different inputs vector sizes. The optical power of the laser is set to 0 dBm for ease of illustration. Besides the attenuation due to the splitting, it can be noticed that the losses introduced by the optical components in the multiplication matrix (i.e. directional couplers and phase shifters) pose a limitation for scaling the network due to the highly attenuated optical intensities reaching the outputs. Fig. 3 shows the optical intensities required at the analog front-end (AFE) to detect a signal with a resolution of n i/p bit. This is obtained by representing the desired output signal and current noises in terms of the received optical intensity as given in Eq. (8). The blue dotted lines represent the optical power at the matrix outputs and the corresponding bit resolution for a laser intensity of 10 dBm. It can be shown that the maximum achievable matrix size is ∼ 35 × 35 for binary networks operating at DR = 10 GS/s. We revisit this calculation again in Section II D.

C. Energy Efficiency
The total electrical power dissipation of the whole network comprises of the power consumed by the laser, input modulators, thermo-optic tuning of the matrix phase-shifters, and the AFE including PDs. Accordingly, for a configuration of size N × N operating at a data rate of DR, the energy efficiency (J/Op) can be calculated as: where P laser , P i/p−drivers , P mem−inter f ace , P mat−tuning and P o/p−AFE represent the electrical power dissipated due to the laser, input modulator drivers, data fetch interfacing circuits, matrix tuning and the output AFE circuits, respectively. P SOA represent the electrical power dissipated if a semiconductor optical amplifier (SOA) is used to recover the loss.
The factor γ refers to the energy efficiency enhancement. It can be represented as γ = ρ 2 opt ρ SOA where ρ opt represents the energy scaling due to the loss of precision factor, and will be described later in Section II D. ρ SOA represents the efficiency enhancement due to an SOA. Assuming an SOA introducing a gain of η SOA (in dB), the corresponding enhancement is ρ SOA = 10 η SOA 10 . The use of an SOA is discussed later in Section IV, but it can be inferred from Eq. (6) that the use of an SOA always degrades the overall energy efficiency.
The amount of power dissipated by the laser is represented in terms of the laser's wall plug efficiency, η W PE , the optical insertion losses introduced by the SMF fiber, IL SMF , the fiber to chip coupling IL EC , the silicon waveguide loss, IL W G , the input MZM loss, IL i/p−MZM , the weight phase shifter loss, IL weight−PS , the directional coupler loss, IL DC , as well as the receiver's PD sensitivity, P PD−opt , as given in Eq. (7).
The total length of the waveguide was roughly approximated as the length spanning the optical depth of the matrix only. The output sensitivity is solved based on the targeted bit resolution, n i/p as well as the total noise at the output frontend due to the photodetector shot noise, dark current I d , thermal noise and laser relative intensity noise (RIN) as given in Eq. (8), with values reported in Table I. The parameters R, R L , k and T represent the PD responsivity, output load resistance, Boltzmann's constant and the absolute temperature.
The MZM drivers consume power that scales linearly with N as P mod−driver = N · DR · E MZM−driver , where E MZM−driver represents the energy efficiency of the MZM driver. For binary resolution, the power consumed by the drivers and AFEs are extracted from recent work on PAM2, and is typically in the range of ∼2pJ/b 49 .
The matrix weights are tuned using thermo-optic phase shifters (TO-PS). Doped Si heaters on SOI platform typically dissipate about ∼ 20 mW for a π-shift 6,50,51 . The efficiency of TO-PS can be improved using other heater materials such as TiN, substrate undercut to improve insulation and deep trenches to reduce thermal cross-talk 50,52 . This can be shown to significantly improve the overall energy efficiency of the network as illustrated in Fig. 5. Assuming uniformly distributed weights, the expected energy consumption of the thermo-optic phase shifter is E T O−PS = 1 P π P π 0 P heater dP heater = P π 2 , where P π denotes the amount of electrical power required to create a phase shift of π. Calculating for all the nodes in Clement's topology, the total average tuning power is N(N−1) 4 P π . After optical processing, the optical data needs to be converted to the electrical domain to be processed, stored or reused in other networks. Efficient opto-electronic receivers, comprising of a PD, TIA and main amplifiers have been shown to have energy efficiencies ∼ 0.4 − 2.4 pJ/b [53][54][55][56][57][58] . For an AFE operating at 10 Gb/s and realized in 40 nm CMOS technology, an energy efficiency of 0.4 pJ/b 53 is assumed for the calculation of binary resolution AFE, which scales with a factor of N for the whole output array. Higher AFE resolutions entail the use of linear TIAs along with analog to digital converter (ADC) circuits to recover the digital data. High speed linear TIAs have shown efficiencies as low as 0.6 pJ/b 59 . The energy consumption for ADCs is extracted from the energy per conversion figure of merit (FOM) such The energy consumption values used in this work for 2b, 3b and 4b are 1.7 pJ/b, 3.1 pJ/b and 5.7 pJ/b based on a FOM of 0.335 pJ/conversion for an ADC designed to operate at a sampling rate of 28 GS/s 60 . Providing high-speed serial inputs to the SiP accelerator requires FIFOs and multiplexers to interface the data transfer with DRAM, as shown in Fig. 4. For a fair comparison to digital CMOS implementations, the power dissipation for both input and output interfacing circuits, represented by P mem−inter f ace is taken into consideration in the energy efficiency calculation of SiP implementations. The power dissipated by the FIFO, multiplexers, clock dividers and retimers is estimated as 5.77 mW in 28 nm CMOS based on the data reported in 61 in 180 nm CMOS technology.

D. Efficiency Tradeoff Factors
It can be inferred from Eq. (7) that the required input laser power increases as a function of N as a result of the exponentially increasing optical losses in the MZM based accelerator. But the total energy efficiency Eq. (6) starts improving as N scales up due to the quadratic increase in the number of accelerated operations performed by the optical matrix as shown in Fig. 5. Taking all optical losses into account shows that there is a scaling limit beyond which optical losses grow significantly and the overall efficiency drops and an optimal network size exists for minimum energy efficiency. Unfortunately, the maximum network scale, N ltd , is limited by the rated output optical power of the laser and the SNR required for any given signal resolution, n i/p ≥ 1b, as given in Eq. (8) and illustrated in Fig. 3. Fig. 6 shows the total energy efficiency and scaling limit for various input resolutions considering thermo-optic phase shifters with and without insulation. The energy efficiencies in Fig. 6 are calculated for accelerators to be operated at binary and higher resolution, n i/p = {1, 2, 3, 4}b. Although the probability of transition reduces for multilevel signaling 5 , the requirement on driver's linearity or segmentation also increases. Furthermore, the energy consumed in the serializing and clocking remains the same 5 . Thus, we assume similar energy efficiency for multilevel signaling as PAM2, ∼ 2pJ/b, for MZM modulators 62 . For binary resolution, the power con- It can be concluded from Fig. 6 that opting for PDs with higher responsivities improves the energy efficiency of the network. This compensates for the optical system loss, relaxes the need to inject high optical power at the network input and improves the overall energy efficiency. Utilizing avalanche PDs (APDs) is a possible way to significantly improve the optical sensitivity 64 . Fig. 6 also suggests that taking advantage of the loss of precision, when possible, shows minor improvement in the energy efficiency and the network scaling.
Considering a matrix that scales with N, the laser optical intensity should be typically scaled by a factor of N to account for the splitting loss in a lossless network. For mesh-like configurations similar to Fig. 1, scaling the input vector size increases the dynamic range of the output intensities. In other words, for a given matrix output, the intensity can be as low as that of a single input or as high as N times that amount. With the input's digital resolution being n i/p , the effective overall output resolution due to the network scaling is n i/p + log 2 N.
The conservative estimate of scaling the input power by N may not be necessary in some computational context such as convolutional neural network (CNN) layers with adaptable hidden layer resolutions 65 ; the increased output resolution might be higher than that needed by the AFE to detect. Therefore, an energy scaling vs. loss of precision trade-off factor, ρ opt , can be introduced to take advantage of the network scaling, 66,67 , as illustrated in Fig. 7. Full accuracy is described by ρ opt = 1 corresponding to reduced output precision of log 2 (ρ opt ) = 0, at which the input optical intensity is scaled by N. Generally, for log 2 (ρ opt ) bit reduction, the input is scaled by N/ρ opt . Therefore, the maximum amount of energy saving is achieved when the log 2 N bit reduction is tolerable at the optical output (AFE input).
To get a meaningful sense of the trade-off between the energy scaling and the loss of precision, ρ opt is quantified in Eq. (9) in terms of the probability of bit errors for binary networks (networks with binary weights) at the output such that: where the Q function is defined as 2π e −u 2 /2 du, and i irn represents the total input referred noise at the AFE input with contributions from the PD, TIA, main amplifiers and comparators (if applicable). Therefore, the laser power, P laser = NP opt−o/p , can be traded off for loss of output resolution.
The IL difference between the bar and cross states of a tunable beam splitter impacts the interference between the nodes in the mesh. For an MZM with intensity loss of α 1 in one arm and α 2 in the other, it can be shown that the output intensity at the cross state is given by where I in1 and I in2 represent the MZM input intensities at its input ports 1 and 2, respectively. To get the transmission response of an MZM with equal losses, α 1 on both arms, θ should be modified such that cos(θ ) = ∆α+2α 1 cos(θ 1 ) in order to account for the loss difference between the two MZM arms, where ∆α = α 1 − α 2 and θ 1 describes the phase shift when both arms have attenuation of α 1 . This difference in insertion losses can be observed in single-arm beam splitters in which phase shifters are controlled by a single arm only. Using dual-arm tunable beam splitters, where θ is implemented differentially using phase shifters on both arms, introduces equal insertion losses for the bar and cross transmissions of each node. A dummy phase shifter can also be used in single-arm topologies to obtain equal losses.
For sake of comparison, the energy consumption for a digital MAC is estimated based on the energy consumed by multiplication and accumulation operations as well as register file access in a 28 nm CMOS implementation 68 . With an estimated energy consumption 0.046 pJ for 8b MAC and 0.0117 pJ for register file access, the calculated energy consumption is ∼ (0.046 pJ+0.0117 pJ)/2=28.85 fJ for a single operation. Conversely, it can be observed from Fig. 6 that SiP networks based on MZMs need to be scaled down to achieve higher resolutions which further degrades their energy efficiency. Compared to their 8b digital CMOS counterpart, the energy efficiencies for SiP MZM MACs (using low-power thermo-optic phase shifters with insulation and 1.2 A/W 69 PD responsivity) at n i/p = 1b to 4b resolutions are 3.5× to 17.5× worse. Despite the lower energy efficiency, MZMbased MAC operations are performed at 2N× higher operating speed and lower latency (for a weight-stationary systolic array with input vector size of N × 1) than the corresponding digital CMOS implementation.
In addition, multiple clock cycles are needed for digital multipliers and adders to provide the MAC output in systolic arrays. Operating at ∼ 10× lower clock speeds further decreases their throughput in comparison to optical implementations. Assuming a number α of clock cycles needed for digital MAC, the latency of a digital systolic array with a 1 × N input vector size and a N × N matrix size is 2Nα/ f CLK−CMOS where f CLK−CMOS represent the clock frequency of the digital CMOS implementation. Hence, the throughput ratio of the optical to CMOS implementations is 2Nα f CLK−OPT / f CLK−CMOS , where f CLK−OPT represents the clock speed at which an optical implementation operates at.
We also investigate the energy efficiency and network size at lower data rates in Fig. 8. Intuitively, the energy efficiency degrades at lower data rates because less number of operations are conducted with respect to the dissipated static power. On the other hand, the network size, shown in Fig. 8, can be increased by making use of the SNR improvement at lower data rates, as inferred from Eq. (8). If maximizing the optical throughput is not an overarching goal, opting for lower data rates (relative to 10 GS/s) leads to larger networks while not sacrificing much on the energy efficiency in implementations incorporating phase shifters with insulation (which do not consume much static power as shown by the dotted curves in Fig. 8 (a)).

A. System Architecture
The SiP MRR-based implementation of an optical accelerator is illustrated in Fig. 9. A comb CW laser source is used to provide wavelengths λ 1 through λ n which are coupled into the chip and then modulated by an array of N MRMs. Unlike mesh-like topologies, implementing vector matrix multiplication in the form of dot products has the advantage of maintaining equal path loss for all the outputs. The modulated input vector, X i/p (λ ), is then split (broadcast) into N branches to be modulated by the weight bank arrays 70 ; each output represents the dot product of the input vector and one of the row arrays of the weight matrix. In order to achieve weights with positive and negative polarities, the thru and drop transmissions of the weight arrays are routed to balanced PDs in a push-pull configuration at the receiver. The current difference at the output, Y O/p , can be represented as 14 : where E 0 (λ ), W dt (λ ) and R(λ ) represent the amplitude of the input optical field, the difference between the weight's drop and thru intensity transmissions and the PD responsivity at a wavelength λ , respectively.    Table III. The optical power of the laser is set to 0 dBm for the ease of illustration. Similar to MZM-based implementations, the attenuation introduced due to the splitting and cascading of microrings significantly degrade the optical power and pose a limitation on the energy efficiency as will be discussed in Section III C. Fig. 11 shows the optical intensities required at the AFE to detect a signal with a resolution of n i/p bit. This is obtained by representing the desired output signal and current noises in terms of the received optical intensity as given in Eq. (8). It can be shown that the maximum achievable matrix size is ∼ 85 × 85 for binary networks. We revisit this calculation again in Section III D.

C. Energy Efficiency
The energy efficiency (J/Op) of the MRM-based implementation with size N × N operating at a data rate of DR can be calculated as: 12) P laser here represents the total electrical power consumed by the optical source (either a single comb laser source or multiple sources generating all the desired input wavelengths). To better study the energy efficiency based on a targeted signal resolution, n i/p , the power consumption of the input laser source is formulated as a function of the optical power reach-ing the PDs at the output AFE as given in (13).
where d MRR represents the gap between the centers of two adjacent microrings and is dictated by the thermal crosstalk which should be taken for design considerations. A d MRR of 15µm has been shown to be sufficient to avoid thermal crosstalk in a photonic switch implementation 72 . We assume a d MRR of 20 µm in this work for an optimistic realization of the system with R MRR = 6 µm.
Since the number of operations scale quadratically with the weight vector size, N, while the energy consumption of the modulators and AFE increases linearly as can be seen in (12), the overall energy efficiency improves with scaling as shown in Fig. 12. We assume similar MRM energy efficiency for multilevel signaling as PAM2, ∼ 0.3pJ/b, 5 which takes into account both the contribution of the modulator driver as well as the serializers. Therefore, the energy efficiency for 2b, 3b and 4b input MRM drivers are estimated in our calculation as ∼ 0.6pJ, 0.9pJ and 1.2pJ per symbol, respectively.
MRMs offer smaller footprint and lower input capacitance which leads to significant reduction in their driving power. However, they are also more sensitive to fabrication mismatch and thermal drift which entails the need to use heaters for calibration across a wide spectral range. Excluding heater power, the power consumed by a closed loop controller implemented for a low-power WDM topology is ∼ 0.2 mW 73 . The average energy efficiency for state-of-the-art MRR heaters on an SOI platform is ∼ 20 mW /π 7,74,75 . The scaling of the power consumed by the heaters with O(N 2 ) degrades the overall energy efficiency of the network. Fig. 13 shows the total energy efficiency and scaling limit for various input resolutions considering thermo-optic phase shifters with and without insulation. We assume power consumption values, P heater , of 2.8 mW 75 and 40 mW 7 for phase shifters with and without insulation, respectively, to provide phase shift of one free spectral range (FSR).
Driving inputs at high speeds entails the need to multiplex data fetched from the memory as shown in Fig. 4. As per the calculations in section II C, P mem−inter f ace is taken as 5.77 mW for the energy efficiency calculation of input or output memory interfacing circuits.

D. Scaling Limitations
It can be inferred from Fig. 13 that it is feasible to implement vector matrix multiplication using MRR based networks with sizes scaling up to N = 85. For n i/p = 2b or above, the network size is within the maximum number of microrings permitted for WDM implementations due to FSR limitations and crosstalk 76 . Attempting to engineer the MRR's dimensions and coupling ratio compromises the quality factor which degrades the channel spacings. This translates to a limit in the MRR vector size of N < FSR/∆λ . Channel spacings are typically set according to the amount of acceptable crosstalk between channels (interchannel interference). As an example, for a 50 nm transmission window with channel spacings of 0.8 nm, the maximum number of channels is 62 76,77 . Thus, for n i/p = 1b, FSR may set a limitation to the overall network size.
Using series coupling to increase the filter order has been experimentally shown to reduce both interchannel and intrachannel crosstalks, thus, maximizing the filter finesse and the channel count 77 . However, this comes at the expense of higher footprint, lower drop port transmission and extra tuning power. To cascade several MRRs for MAC operations, it is necessary to maintain channel spacings to avoid the adjacent weight-dependent cross-talk. The number of channels that can be supported by optimized MRRs with finesse of 368 and 540 are calculated to be 108 and 148, respectively 76,78,79 .
Two-point coupling scheme has been proposed to address the post-fabrication correction of MRM spectral features for large-scale MRM implementations 80 . Although it mitigates the secondary resonances of an MRM and doubles the FSR, an extra micro-heater is introduced to correct for the coupling which increases the power consumption. Another attempt to achieve an FSR-free filter has been demonstrated using tunable couplers along with modified vernier filters that use higher-order coupled MRRs 81 . However, this topology is associated with penalty in terms of design complexity, increased footprint and tuning power.
Introducing contra-directional coupling (CDC) in a microring combines the wavelength selectivity of the CDC with the compact feature size of the MRR, thus reaping the advantages of both and providing an FSR-free response 82 . Implementing this design technique allows the potential use of several channels in MRR-based accelerators. This comes with a tradeoff of using extra heaters in the CDC and in the region of the MRR that does not include corrugated structures.
As shown in Fig. 13, the optimum energy per operation of binary SiP networks based on MRRs (∼ 75 fJ) is obtained at N = 85 for PD responsivity of R = 1.2 A/W. It can also be shown that reducing the power consumption of weight tuning circuits by one order of magnitude improves the energy efficiency by roughly one order of magnitude as well. Compared to their 8b digital CMOS counterpart, the energy efficiencies for SiP MRM MACs (using low-power thermo-optic phase shifters with insulation and 1.2 A/W PD responsivity) at n i/p = 1b to 4b resolutions are 2.6× to 13× worse. In comparison to MZM-based implementations, MRR-based implementations can have 1.8× bigger network scale and achieve 1.3× lower energy consumption per operation. Similar to MZM-based implementations, MRR MACs are performed at a 2Nα f CLK−OPT / f CLK−CMOS × higher throughput than its digital CMOS counterparts. Fig. 14 shows the energy efficiency and scaling at lower data rates. Similar to MZM implementations, reducing the data rate degrades the energy efficiency while scaling up the network size due to the reduced noise levels at the AFE.

IV. RESEARCH OPPORTUNITIES
As summarized in sections II and III, SiP accelerators operate at much higher speed and lower latency than their CMOS counterparts. Nevertheless, it is further desired to improve the size of the MAC networks in SiP, especially for neural network applications, and improve the energy efficiency. There has been several promising research in the field of SiP. Classifying the existing commercial SiP technology as the first generation, we describe several emerging technologies that will make up the next generation of SiP Fig. 16 summarizes the advancements in SiP that can be leveraged by SiP-based accelerators to reduce optical loss, improve the energy efficiency and incorporate heterogeneous integration techniques for performance improvement.

A. Optical Loss Reduction
As described in sections II and III, optical losses limit the scalability of the SiP technology. Losses must be minimized at the coupling interfaces and in the components. PWB is one way to ensure efficient coupling between the chip and the optical fiber with insertion loss ∼ 1 dB with negligible variation 35 . Passive alignment to SMF optical fibers can be accomplished using V-grooves arrays. Such fiber to chip self-alignment has been shown to have coupling efficiency of ∼ −1.3 dB 83 . In another demonstration, coupling losses as low as ∼ 0.5 dB and ∼ 0.35 dB have also been reported for passive and active alignments, respectively 84 .
For scaling up the networks, the optical signal attenuation can be compensated by using SOAs. On-chip SOAs can be utilized to pre-amplify the input signal and also exploited as weight matrix elements to provide weights magnitudes > 1 85 . However, the non-linear gain-current curve entails a need for calibration.
Improving the responsivity of the AFE is yet another way to tolerate the optical losses. It relaxes the need to increase the laser power to compensate for the losses. The high multiplication gain and responsivity of APDs have been shown to improve the sensitivities of optoelectronic receivers front-end 64 . Improving dark current and quantum efficiency by careful design of the APD geometry has been projected to improve the sensitivity of Si-Ge APD receivers up to −29 dBm at 12.5 Gb/s 86 as compared to −18.5 dBm for Ge PIN detectors 87 . Limiting the bandwidth of the AFE and using equalization techniques 88 such as continuous time linear equalization (CTLE) and decision feedback equalization (DFE) 89 can reduce the input referred noise of the AFE and further improve the sensitivity.

B. Improving energy efficiency
Commercial CW lasers suffer from low WPE in the range of ∼ 1% − 10%, which impacts the energy efficiency on the system 90,91 . Hybrid-integrated silicon photonic lasers have been shown to provide ∼ 12.2% WPE 92 .
Although introducing on-chip CW lasers mitigates the coupling losses, the feasibility of using them, especially for networks using WDM, require wavelength stabilization and reflection cancellation 93,94 .
Reducing the power consumption of the phase shifters in the weight matrix is a critical requirement given that their overall energy consumption scales quadratically with network size (Eq. (6) and Eq. (12)). Thermo-optic phase shifters dissipate high power consumption given their resistive nature. Introducing trenches, undercuts and back-side substrate removal has been shown to improve the tuning efficiency of the rings by an order of magnitude with measured reported power consumption of ∼ 4 mW per FSR 75,95,96 . However, thermal isolation and substrate removal exacerbates self-heating and must be taken into consideration while designing a CMOS controller 97 .
Several post-fabrication schemes have been investigated to correct for the fabrication-induced variations. Reducing the process variations was investigated by patterning SiN on top of the Si waveguide to introduce field perturbations which effectively adjusts the optical path length 98 . Another demonstrated technique relies on trimming using Ge ion implantation followed by laser annealing to tune MRR resonant wavelength across the whole FSR without introducing any excess loss. Its accuracy, CMOS-compatibility, and feasibility for wafer-scale correction renders it a potential technique to be utilized in optical neuromorphic implementations to reduce the tuning power 99 .
Alternatives such as nano-opto-electro-mechanical systems (NOEMS) 16,100 and liquid crystal on silicon (LCOS) 101 have the potential to reduce the tuning power overhead significantly. The dynamic energy consumption of NOEMS was reported in the range of 0.13 fJ and 0.32 fJ for digital pulse signals 100 , and is assumed as ∼ 1 fJ for our study. On the other hand, LCOS have been shown to dissipate power as low as 2 nW 101 .
Phase change materials (PCMs) such as Ge 2 Sb 2 Se 4 Te(GSST ) have been demonstrated as compact phase shifters in which the optical phase shift is obtained by tuning the state of the material from amorphous and crystalline 102,103 . Being able to sustain their crystallization state with the absence of power renders them as good candidates for tuning low-speed weights in SiP implementations with no static power consumption. Given their compact sizes and non-volatile nature, the efficiency of implementing them in large-scale SiP networks is investigated as shown in Fig. 15. Although not as lossy as PN phase shifters, PCMs have an IL = 0.32 dB which is relatively high for cascaded phase shifters in a large-scale implementation 102 . This limits the network sizes for computation with several bits of resolutions. The resolution of the weights can be set by adjusting the level of crystallization of a PCM cell 104 .
The pulse energy consumption for writing and erasing levels 1-7 were reported in the range of 372pJ − 601pJ and 562pJ − 373pJ 104 , respectively. Assuming the weights to be uniformly distributed, the average energy consumption, E PCM , for setting the PCM to various weights can be calculated as in Eq. (14), where E A and E C represent the pulse energy re-quired to write (amorphization) and erase (crystallization) the first level (L 1 ), respectively. For levels, L i , where i > 1, ∆E A and ∆E C represent the average amount of energy required to transition to one level higher or a lower, respectively, with all levels assumed to be equally spaced for the sake of simplicity.
Assuming E A = 372 pJ, E C = 373 pJ, ∆E A = (601 − 372)/(2 n − 2) pJ and ∆E C = (562 − 373)/(2 n − 2) pJ, the estimated average energy consumption for a PCM phase shifter with n = {1, 2, 3, 4}b equals {186, 231, 165, 121} pJ. For phase shifters with zero static power dissipation, the dynamic energy consumption is divided by the number of times weights have been reused for vector matrix multiplication. This is done by introducing a weight reuse factor, α w , such that P mat−tuning = P NOEMS,PCM /α w , where α w typically ranges between 2 6 to 2 18 in general matrix multiplications (GEMMs) 105 . A value of α w = 4096 is chosen for the calculation of energy efficiencies in this work. For networks where the weight reuse is low, the contribution of the dynamic energy per operation can be considerably higher for PCM than all the other weight tuning alternatives, degrading the energy efficiency by orders of magnitude. Fig. 15 shows the energy efficiency breakdown for both the MZM-based and MRR-based architectures for weight tuning that rely on thermo-optic phase shifters without and with insulation 75 , NOEMS, LCOS and PCM. For each implementation, energy calculations are based on the network scales that can satisfy the SNR requirements to compute with bit resolutions, n i/p = {1, 2, 3, 4}b. The insertion loss for a 35 µm long LCOS used to realize a π phase shift is taken as 0.35 dB 101 . For MZM implementations, thermo-optic phase shifters with insulation seem currently attractive for energy efficiency and network size. For a large weight reuse factor, NOEMS-based phase shifters promise further energy reduction. For MRM implementations, similar conclusions can be drawn except that LCOS-based phase shifters also seem promising.
For both MZM-based and MRR-based architectures, it is evident that opting for matrix weight tuning alternatives with almost zero power consumption significantly improves the total energy efficiency, with values approaching < 100 fJ/Op for both architectures. Further research is still needed to demonstrate the feasibility of these approaches in high volume production to realize such energy efficiency regime.

C. SOA Cascadability
SOAs are used to amplify optical signals over a given spectrum and can be implemented off-chip or using hybrid integration. Cascading SOAs has been conventionally used to restore signal levels in interconnect links and has been recently explored in deep neural network implementations 85 .
Although SOAs help compensate for the optical losses to increase the scale of the network, they contribute significantly to the energy consumption. In addition, they suffer from several downsides, including the noises produced due to the optical amplification, the ripples in their gain spectrum, and amplification nonlinearity [106][107][108] . The major noise component is attributed to the amplified spontaneous emission (ASE) of photons towards the input and output of an SOA 108,109 . Therefore, the build-up of ASE noise due to the cascade degrades the SNR of the optical signal reaching the output 106,108 .
To investigate the effect of incorporating SOAs on the scalability of the network as well as the received signal resolution, the overall SNR is quantified in Eq. (15), with an SOA gain value as G = 17 dB 46,110 . B o and B e stand for the optical bandwidth of the amplifier and electrical bandwidth of the AFE, respectively. ρ ASE represents the ASE noise and is calculated as: where N SOA and n sp represent the number of SOAs used in the network and the spontaneous emission factor of the optical amplifier, respectively. The parameters given in Table I are used for calculating the SNR and N SOA . For a target network resolution, n i/p , the number of SOAs are calculated based on the resultant SNR whose signal and noise power values vary with the network scale. The SNR is calculated as in Eq. (17).
As can be inferred form Fig. 2 and Fig. 10 for an SOA-less network, the SNR degrades since the signal optical intensity at the AFE P o/p is attenuated with the scale of N. Incorporating an SOA helps replenish the signal intensity. SOAs can be added in the network before the SNR degrades below the threshold for a given resolution due to the insertion losses. However, the signal dependent noises are amplified as well which do not align in favor of the network resolution. For calculating the resolution, the AFE is assumed to tolerate signals with maximum P opt intensity as high as 10 dBm, beyond which the number of SOAs are limited. Regions with no SOAs at the right side of Fig. 17 and Fig. 18, shown in appendix A, represent regions where the desired SNR cannot be achieved, indicating an infeasible resolution for a given scale. SOAs require introducing III-V or II-VI compound semiconductor materials to the SiP platform which is noncompliant with the standard SiP CMOS foundry runs. Back propagating ASE noise from the SOA emphasizes the need to employ an optical isolator for the input laser source 106 . Use of narrowband filters are also possible to reduce the out-ofband noise 85,106 , but these further reduce the maximum channel count, thus limiting the scaling of the network. To be used for linear analog computations, SOAs should deliver gains that are independent of the input intensities. Cross-gain modulation (XGM) is one type of non-linearity observed in SOA amplifiers in which the combination of all the input intensities impacts the gain of a single channel 111,112 .
The aforementioned SOA limitations should be addressed in order to maintain the linearity of the weight matrix and allow further scaling of networks. Designing highly efficient SOAs has the potential to increase the size of the networks. However, improving the network energy efficiency requires that the SOAs have low injection current and high optical signal to noise ratio. Recent attempts to use SOAs for optical networks have reported power consumption of 42 mW per SOA which leads to an energy consumption of ∼ 4.2 pJ/Op at a DR = 10 GS/s 85 . To carry out 4 weighted additions, 16 SOAs were used for the weight tuning, along with extra SOAs for optical pre-amplification and input selection. A crosstalk of 0.6 dB was reported for the SOAs even for a small-scale circuit implementation of arrayed waveguide grating (AWG) filter, which necessitated the use of feedback loops for gain calibration. For accelerators with SOA integration, the contribution to the overall network energy efficiency scales with O(N 2 ), setting the efficiency to the pJ/Op regime.

V. CONCLUSION
We describe the behavior of MZM and MRM based SiP implementations for MAC accelerators based on today's SiP 1.0 technology. Both MZM and MRM implementations share similar optical and electrical challenges. In comparison to digital CMOS accelerators, SiP implementations have relatively higher energy consumption and operate at lower bit resolutions. In addition, they cannot be scaled to large network sizes because of the optical losses. Implementing MACs using SiP has two distinct advantages 113  2. Multiplication operations can be intrinsically implemented in parallel in which the analog nature of the computation allows all matrix operations to take place at the same time for each input fetch 114 . Therefore, optical MAC implementations can increase their throughput and improve the energy efficiency at such high speeds. MACs implemented with digital circuits in CMOS are limited by the wiring interconnect density.
There are also some challenges of implementing MACs using SiP: 1. The losses in optical circuits severely limit the size of the MAC networks that can be physically realized, in comparison to a digital CMOS implementation where signal gain and regeneration is easily available. This, in turn, limits the applications of SiP MACs.
2. Although the power consumed in the optical MAC is small, when accounting for the losses and the power consumed by the laser and CMOS electronic circuits that drive and control the optical circuits, the energy efficiency is degraded.
3. Unlike digital CMOS implementations which can support 16b/32b resolution, analog photonic MACs have a maximum demonstrated resolution of 8.5b 115 . Nevertheless, such a resolution has been shown to be adequate for many inference tasks 116-118 . 4. Achieving a high throughput optical network entails accessing data at high speed. High-speed input/output (I/O) data can be streamed in/out from/to an offchip DRAM; the corresponding energy consumption in moving the data must be considered for the overall implementation of an accelerator 119 . The energy consumption in data fetch from off-chip DRAM is significantly large. However, for a given dataset, the DRAM associated penalty is similar for both optical and digital CMOS implementations. We exclude that penalty in our work. For weight-stationary implementations where weights do not need to change frequently, an onchip SRAM can be used, which can be adequately large due to the limited network size of the photonic accelerators.
5. Most of the low-loss phase shifters have a reconfiguration speed in the range of µs-to-ms, making the weight reconfiguration in photonic MACs significantly slower than their electronic counterparts. This limits the use of photonic MACs to weight-stationary systolic array implementations, where the incoming data is high-speed, but the weight does not get updated quickly. For other scenarios, a fine weight retuning can be done with highspeed plasma-dispersion phase shifters, where the loss is controlled due to the need for a fine weight tuning range only.
6. To carry out optoelectronic computing with several bits of resolutions at high speed, CMOS or biCMOS drivers and transimpedance amplifiers (TIAs) are needed that must operate with multi-level pulse-amplitude modulation (PAM) signaling. Although many PAM2 (1b) and PAM4 (2b) transceivers have been demonstrated 62,120 , higher levels of modulation require linear drivers and TIAs which are challenging to design at high speed and good energy efficiency 3 . However, if data rates in optical computing are limited to a few tens of GBaud, this challenge is surmountable.
7. Packaging considerations in optics (e.g., laser, fiber and SOA attach) are far more challenging than the packaging considerations for electronic dies due to alignment accuracy and thermal management requirements 121 .
The mesh-like interconnections of MZM-based implementations which extend the optical dynamic range at the AFE place a tradeoff between the output resolution and the power requirement of the laser. MZM drivers also consume higher electrical driver power because MZMs cannot be made very long due to the losses associated with their larger footprints. Therefore, the power consumption of the laser and high speed drivers are amortized with a limited network scalability. On the other hand, MRR-based topologies provide better energy efficiencies due to their small footprint and thus lower modulation energy consumption. The overall energy efficiency for either of the implementations experience major degradation mainly due to the inefficiency in lasers and phase shifters, the insertion and excess losses of the optical components as well as the optical to electrical and electrical to optical conversion overhead.
However, an order of magnitude higher operating speed as well as the inherent parallelism in conducting multiplication for analog signals render them attractive for reducing delay and enhancing throughput in comparison to digital CMOS implementations. With the emerging technologies in SiP, e.g. NOEMS and LCOS, low energy tuning schemes have the potential to significantly improve the energy efficiency of the photonic accelerators. Nonetheless, thermal PS with insulation is still an efficient weight-tuning option which is attractive for mass production. Low voltage-swing modulators, with heterogeneous integration of polymers also promise improvements in energy efficiency due to the significant reduction in modulator and CMOS driver power consumption.
Scaling SiP accelerators to larger network sizes is limited by the rated output optical power of the laser as well as the SNR required for a given signal resolution, n i/p . MRMbased networks can scale to larger values than their MZMbased counterparts due to the lower loss associated with cascading microrings in a WDM implementation. Incorporating high-power multi-wavelength lasers will be crucial. However, MRR-based networks are more sensitive to temperature, and often require temperature control between the photonic IC and the laser. The size of MRR-based networks that have been demonstrated in prototype hardware have been limited to 8 modulators 122 or 16×16 switch 7 . Until larger MRMbased networks are demonstrated in hardware, the adoption of MZM-based networks will continue to be favored.
Heterogeneous integration of low-noise SOAs in SiP is a possible way to increase the network size, but the high power consumption of SOAs degrade the energy efficiency significantly. Higher responsivity and low-noise APDs will also prove beneficial in scaling up the network sizes. The size can be further scaled up by reducing the insertion losses of contributors such as directional couplers, phase shifters (if lossy), etc. The splitters remain a significant limitation for scaling. To make efficient use of the limited optical network sizes, general matrix Multiplication (GEMM) algorithms must be adopted.
Enhancing the energy efficiency can be achieved by adopting modulators with low static and dynamic power consumption, high-responsivity APDs along with TIAs with high sensitivity, and efficient SOAs and lasers with high WPE. To better address the need for high resolution, multi-level signaling significantly beyond PAM4 and PAM8 must be implemented. Controlling the temperature of the chip also helps with maintaining high resolution. Besides, the crosstalk and distortion of SOAs should be further investigated.

ACKNOWLEDGMENTS
This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC). Access to CAD tools and technology is facilitated by CMC Microsystems. The authors acknowledge Dr. Alex Tait at Queen's University and Avilash Mukherjee at UBC for their technical comments.

AUTHOR DECLARATIONS
The authors declare no conflict of interest.

DATA AVAILABILITY
The data that supports the findings of this study are available within the article.

APPENDIX: SCALING UP USING SOAs
This section illustrates the feasibility of incorporating SOAs in SiP networks in Mach Zehnder and microring based implementations. Fig. 17 and Fig. 18 illustrate the number of SOAs in terms of the network resolution and scale for MZM and MRM based implementations, respectively. It can be observed that the resolution which a SiP network is desired to operate at is dependant on the network scale.
It can be inferred that an MZM-based network can support signal resolutions of 4b for a network size of N ltd = 55, with a single SOA. In comparison, incorporating SOAs in MRM implementations increases the network limited scale to N ltd = 94 for n i/p = 4b as can be shown in Fig. 18. The number of SOAs that can be added to a network are limited to 1 or 2.      13. Total energy efficiency and scaling limit of an MRM based network with an input resolution of n i/p = {1b, 2b, 3b, 4b} considering thermo-optic phase shifters with and without insulation at the scaling limit, N ltd , where the laser rated power output (10 dBm) is reached. Implementing networks with higher resolution requires scaling down the network to improve the SNR at the AFE. This degrades the energy efficiency due to the reduced number of operations.