# Distributed Architecture for FPGA-based Superconducting Qubit Control Neelay Fruitwala,<sup>1</sup> Gang Huang,<sup>1</sup> Yilun Xu,<sup>1</sup> Abhi Rajagopala,<sup>1</sup> Akel Hashim,<sup>1,2</sup> Ravi K. Naik, <sup>1,2</sup> Kasra Nowrouzi,<sup>1,2</sup> David I. Santiago,<sup>1</sup> and Irfan Siddiqi<sup>1,2</sup> <sup>1</sup>Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA <sup>2</sup>University of California at Berkeley, Berkeley, CA 94720, USA Abstract—Ouantum circuits utilizing real time feedback techniques (such as active reset and mid-circuit measurement) are a powerful tool for NISQ-era quantum computing. Such techniques are crucial for implementing error correction protocols, and can reduce the resource requirements of certain quantum algorithms. Realizing these capabilities requires flexible, low-latency classical control. We have developed a custom FPGA-based processor architecture for QubiC, an open source platform for superconducting qubit control. Our architecture is distributed in nature, and consists of a bank of lightweight cores, each configured to control a small (1-3) number of signal generator channels. Each core is capable of executing parameterized control and readout pulses, as well as performing arbitrary control flow based on midcircuit measurement results. We have also developed a modular compiler stack and domain-specific intermediate representation for programming the processor. Our representation allows users to specify circuits using both gate and pulse-level abstractions, and includes high-level control flow constructs (e.g. if-else blocks and loops). The compiler stack is designed to integrate with quantum software tools and programming languages, such as TrueQ, pyGSTi, and OpenQASM3. In this work, we will detail the design of both the processor and compiler stack, and demonstrate its capabilities with a quantum state teleportation experiment using transmon qubits at the LBNL Advanced Quantum Testbed. ### I. INTRODUCTION ROOM temperature RF control systems have become a critical part of the superconducting quantum computing stack. With qubit counts in the 10s to 100s, general-purpose RF measurement equipment, such as AWGs (arbitrary waveform generators) combined with discrete RF components, have proven to be overly costly and inefficient for qubit control and measurement. As a result, special-purpose instrumentation has emerged, in both the commercial [1], [2], [3] and academic [4], [5], [6], [7] realms. These systems integrate pulse sequencing, digital pulse generation, and readout, and are typically built around commercially available FPGAs or SoCs. The ability to make real-time control decisions based on mid-circuit measurements is becoming an increasingly important part of quantum hardware systems; being a key part of several proposed [8] and realized [9] quantum algorithms. For superconducting qubits, with coherence times $\sim 100~\mu s$ , real-time feedback requires a controller with latencies $\sim 100~ns$ . In practice, this means that the feedback control logic must be tightly integrated with the pulse sequencing layer; using an external controller or CPU would significantly increase latency. In this work, we present an FPGA-based distributed control architecture which combines pulse sequencing with arbitrary measurement-based control flow. Our design consists of a bank of lightweight, configurable processor cores that are designed to tightly integrate with the puslse generation and signal processing gateware. We also provide a Python/JSON-based intermediate representation for writing and compiling dynamic quantum programs. #### II. OVERALL APPROACH AND SYSTEM REQUIREMENTS The scope of this work includes the pulse sequencing and parameterization layer of the FPGA gateware – *not* the digital pulse generator modules themselves. We designed this layer to interface with the QubiC 2.0 [10] pulse generation and readout modules; though we believe that our architecture can be adapted to other qubit control systems that use digital pulse synthesis methods. We designed our system around the following principles/requirements: - Pulse-centric design: the primary control primitives are RF control/demodulation pulses; no intrinsic information or assumptions about quantum (unitary) operations being performed. This simplifies the processor core design and instruction set, and makes it straightforward to implement non-standard unitary operations (e.g. optimal control based approaches and certain calibration sequences[11]). - 2) **Low-latency**: Superconducting qubits have coherence times $\sim 100~\mu s$ . This means that for conditional operations based on mid-circuit measurements, we require an end-to-end feedback latency $\sim 100~ns$ . - 3) Lightweight: Real-time pulse generation places high demand on FPGA logic and memory resources, particularly on high-channel count devices such as the Gen3 Xilinx RFSoC [12]. So, the pulse-sequencing layer should be as lightweight as possible to accommodate a large number of pulse generators on the same SoC/F-PGA. - 4) Flexible: Superconducting qubit systems have a wide variety of architectures and qubit modalities, each with differing control needs (e.g. readout multiplexing factor, qubit coupler control, and desired instantaneous bandwidth). Our architecture needs to accommodate this variety of pulse generator configurations, and be Fig. 1: Block diagram of the distributed architecture. In this example, each processor core is responsible for control and readout of a single qubit. Note that the measurement and state-discrimination signal chain exists outside the core, with the results fed directly into the function processor block. straightforward to configure at the gateware and software level. #### III. ARCHITECTURE Our architecture consists of a bank of soft processor cores that is responsible for the realtime execution of quantum programs, which involves pulse sequencing, parameterization, and triggering. Each processor core is lightweight, and is designed to interface with a small ( $\sim 1-5$ ) number of digital pulse generators that are used for qubit control and readout. This design mirrors the parallelism inherent to quantum circuit execution, which ensures scalability; having a bank of parallel cores (and fixed number of output channels per core) avoids bottlenecks/latency issues that can arise in single-threaded designs as channel count grows. To enable mid-circuit feedforward operations, we also include an extensible "function processor" module for aggregating and distributing (optionally processed) measurement results to the processor cores. # A. Processor Core Each processor core implements a custom instruction set architecture (ISA) consisting of pulse commands for real-time control of the associated signal generators, as well as standard arithmetic and control flow instructions for on-the-fly pulse parameterization and the execution of dynamic quantum programs. The full instruction set is detailed in section IV. 1) Signal Generator Interface: Each core is responsible for controlling a small bank of signal generators in real time. This involves both: 1) specifying pulse parameters, such as frequency, phase, and modulation envelope; and 2) triggering the pulse at the correct time. QubiC 2.0 [10] uses DDS-based (direct digital synthesis) pulse generation modules, which can synthesize a carrier tone at the provided frequency, phase, and amplitude, and can apply a complex modulation envelope given by a time series of values. In the QubiC $2.0\ core \rightarrow signal$ generator interface, the phase, amplitude, and pulse duration are provided directly via a bus, while the envelope (stored as a series of time-domain values) and frequency (stored as a series of phase offsets per unit time) are pre-allocated in dedicated memory banks, which are configured when uploading the quantum program to the FPGA. It is then the *address* of the envelope/frequency within these buffers that is specified by the processor core. The processor core $\rightarrow$ signal generator interface consists of the following components: - Register for storing pulse parameters. Amplitude (16-bit), phase (17-bit), and pulse duration (12-bit) are provided directly, along with pointers to the locations of the modulation envelope (12-bit address) and frequency (9bit address) in their respective buffers. A configuration word (4-bit) is reserved for miscellaneous parameters. - 1-bit active high pulse trigger (c\_strobe) The pulse register fields and trigger time are configured by pulse instructions; see section IV-A for details. - 2) Pulse Timing and Synchronization: All pulse triggers are referenced to an internal counter, which is reset at the beginning of the program. This reset is synchronized across all cores. Additionally, QubiC 2.0 has mechanisms for clock sychronization + synchronized reset across multiple FPGA boards [?], ensuring that all pulse triggers and reference clocks are synchronized even when cores are distributed across hardware. - 3) Microarchitecture: The processor core microarchitecture is outlined in figure 2. It is similar to a simple MIPS [13] architecture, with a general-purpose 16x 32-bit register bank, 32-bit ALU (arithmetic logic unit), and instruction pointer (or program counter) for interfacing with program memory. For simplicity, the ALU only implements comparison, arithmetic (addition and subtraction), and identity operations. Instructions are implemented using a simple multi-cycle state machine with pipelined instruction fetching. The program memory, pulse interface, and function processor interface are implemented generically in SystemVerilog for portability. We chose an instruction width of 128 bits to accommodate the full 71-bit pulse register, along with the 32-bit pulse start time and other instruction metadata. # B. Function Processor Each processor core implements a "function processor" interface for connecting to external computational resources. This interface is primarily intended for requesting/receiving (optionally processed) measurement results, although any data/computation with a compatible format can be requested. The core can request data over this interface by specifying an (implementation specific) 8-bit ID which encodes the type of data to retrieve or computation to perform. Once the data is ready, it is returned as a 32-bit word, along with a ready signal. This request/receive pipeline is triggered by a special instruction, which halts the execution of the core until the resulting data is received. At that point, it can be stored in a Fig. 2: Processor core microarchitecture. Includes a register file, ALU, and instruction pointer for arithmetic and control flow instructions. All pulse triggers are referenced to the time\_ref block, which is a counter that is reset at the beginning of program execution and can be incremented during runtime. All instructions are implemented as 128-bit words. Pulse fields are written to the Pulse Register block, and can be provided by values from the register file and/or instruction immediates. The pulse trigger is given by the c\_strobe signal. register, used as a pulse parameter, or used for a conditional branching decision. In the current implementation on QubiC 2.0, the function processor interface simply accesses a memory bank containing the most recent state discriminated measurement result from each of the eight qubits driven by the respective FPGA board. This allows any core to request a result from any qubit (provided that it is driven by the same FPGA). Future implementations may extend the function processor to include results from different (synchronized) boards or application-specific measurement decoders. #### IV. INSTRUCTION SET ARCHITECTURE Each processor core implements an instruction set consisting of 1) pulse instructions, for parameterizing and triggering pulses; 2) standard register arithmetic and control flow instructions; 3) special-purpose instructions for timing control and interaction with the FPROC interface. In the following section we provide a general overview of the different instruction types; an exhaustive reference can be found in [14]. #### A. Pulse Instructions In general, there are two different types of pulse instructions: pulse\_write, which writes to the specified fields of the pulse register, and pulse\_write\_trig, which has all of the functionality of pulse\_write, but also triggers the pulse at the specified trigger time. The general format for both of these instructions can be found in figure 4. 1) pulse\_write: Pulse register fields can be written to by either an immediate value or a native processor core register (with the exception of a 4-bit configuration word, which must be an immediate). But, only one processor register can be accessed during any given write. So, the pulse\_write instruction has two additional bits per field: i) write\_enable, which controls whether to write to that field, and ii) register/immediate select, which controls (if write\_enable is high) whether the input value comes from the selected register, or the instruction immediate. 2) pulse\_write\_trig: The pulse\_write\_trig instruction adds a 32-bit start\_time field, which activates the pulse trigger at the provided value, which is in units of FPGA clock cycles since program start and is referenced to an internal counter (figure 2). Processor core execution is halted until the pulse is triggered. # B. Timing Control There are certain situations (for example, when looping over a pulse sequence or waiting for a measurement) where the timing-related behavior of a program must be altered. We provide two instructions for this: the inc\_qclk instruction, which will increment the time reference by a signed immediate or register value, and the idle instruction, which halts execution of the core until the provided timestamp. #### C. Arithmetic and Logical Operations Register-based arithmetic and boolean operations are performed using reg\_alu instructions. Supported operations include boolean comparisons (<,>,=), identities, addition, and subtraction, all on 32-bit signed values. Both register-based and instruction immediates are supported. Results are always stored in a register. #### D. Control Flow Any ALU-based boolean comparison can be used to control a jump instruction, which will set the instruction pointer to an arbitrary location in the program memory. Destination addresses must be instruction immediates. Unconditional jumps are also supported. #### E. Function Processor Function processor instructions are used to request/receive data over the FPROC interface. These instructions extend ordinary ALU and control flow instructions, but replace one of the fields with the FPROC result. For example, the <code>jump\_fproc</code> instruction replaces the RHS input of the jump condition with the FPROC result. #### V. ASSEMBLY LANGUAGE We provide a human-readable assembly language that is approximately a one-to-one mapping to the processor core instruction set. The language is formatted as a list of JSON [15] strings, with the assembler and associated infrastructure written in the Python programming language. The assembly language instruction fields match those of the instruction set with the following exceptions: - Pulse parameters: all pulse parameters (frequency, amplitude, phase) are provided as floating point values. Frequency and phase are given in SI units, while amplitude is normalized to the DAC full scale. Envelopes are provided as parameterized functions or complex NumPy arrays. Pulse output channels are named, and resolved to gateware/hardware indices during assembly. - Register names: for readablity, register names are provided as strings, and are resolved into indices during assembly. - Register types: for straightforward pulse parameterization, registers are typed as amp, phase, or int. All operations on amp and phase type registers are provided in their respective units (float in range [0, 1] for amp and radians for phase), and are converted to the corresponding pulse-field word during assembly. No conversion is performed with int type registers. - **FPROC ID**: function IDs can optionally be specified according to named output channel attributes in the provided channel configuration file. For example, in the program in figure 5, the function ID is provided by the core\_id parameter of the Q1.rdlo channel. The assembler takes as input a separate list of instructions for each core, and generates the following outputs: 1) per core program binaries; 2) corresponding set of envelope and frequency buffers. These binaries are stored in a Python dictionary, where they can be loaded by the low-level QubiC driver software into the FPGA BRAM (block-RAM). The assembler is configured using the following: - ElementConfig implementation: ElementConfig is a generic Python class that is implemented separately for each type of firmware signal generator block. It is responsible for converting the provided pulse phases and amplitudes into the correctly formatted words, and computing the frequency and envelope buffers. - Channel configuration file: this file maps named output channels to firmware channel indices. It may also optionally parameterize the implemented ElementConfig class. # VI. COMPILER TOOLS AND INTERMEDIATE REPRESENTATION In order to provide users with a high-level format for writing QubiC programs, and to interface with higher-level tools such as TrueQ [16], OpenQASM [17], and PyGSTi [18] we provide a custom intermediate representation (QubiC-IR), along with a set of compiler tools for lowering QubiC-IR to distributed processor assembly. We designed QubiC-IR to have the following general attributes: - Multi-level: In order to provide users with a variety of interfacing options (e.g. native-gate level vs pulse level), QubiC-IR operates at multiple abstraction layers. Only a subset of instructions is directly compilable into distributed processor assembly. - Program flow is single-threaded; the scheduling and compilation tools will parallelize control operations and determine which core(s) need to be targeted by each instruction. - As with the assembly language, QubiC-IR is primarily represented as JSON; IR lowering and compilation is performed using a Python API The bulk of the compilation is performed in a series of passes that transform the IR. Once the IR has been sufficiently lowered, a final pass will convert it to distributed processor assembly. The compiler flow is customizable; users can both configure individual passes and specify the set of passes to run In the following sections, we give an overview of IR instruction types and associated compiler flows. A full reference can be found at [19]. ### A. Control Operations: Gates and Pulses QubiC-IR supports a Pulse instruction that is largely identical to that of the assembly language. We also support a Gate instruction that allows the program to be written at the native quantum gate level, which can then be resolved into pulses by specifying a calibration file containing the pulse parameters associated with each gate. # B. Classical Variables and Arithmetic QubiC-IR supports the declaration and manipulation of variables to perform classical computations. Variables are | 127:124 | 123 | 122:120 | 119:88 | 87:84 | 83:68 | 67:52 | 51:0 | |---------|-------|---------|-------------------|-------------|-----------------------|----------|-------| | opcode | (r/i) | ALU op | ALU input 0 (r/i) | ALU input 1 | dest reg or jump addr | FPROC ID | 52'b0 | Fig. 3: General format for arithmetic and control flow instructions. The instruction type is given by the opcode. Bit 123 (r/i) is used to specify whether ALU input 0 is an instruction immediate or register value from the provided address. The inc\_qclk instruction also follows this format, with only the opcode fields (127:120) and ALU input 0 provided. | | 127:120 | 119:116 | 115:114 | 113:90 | 89:88 | 87:71 | 70:69 | 68:60 | 59:58 | 57:42 | 41 | 40:37 | 36:5 | 4:0 | |---|---------|----------|----------|----------|------------|------------|-----------|-----------|----------|----------|--------|----------|------------|-----| | ĺ | opcode | reg addr | env ctrl | env word | phase ctrl | phase word | freq ctrl | freq word | amp ctrl | amp word | cfg en | cfg word | start time | 0 | Fig. 4: General format for pulse\_write and pulse\_write\_trig instructions. Each pulse field (env, phase, amp, and freq) has two control bits; one for write enable and another to select register (from address in 119:116) or instruction immediate. In our implementation, the phase and amplitude are specified directly as scaled values, the frequency is provided as an address, and the envelope word specifies both the start address and envelope length. A 4-bit config word is provided for miscellaneous configuration parameters; this must be provided as an instruction immediate. The idle instruction also follows this format, but the only provided fields are the instruction opcode and the start time, which provides the timestamp after which to resume core execution. a generalization of assembly language registers; supported operations and allowed datatypes (int, phase, amp) are the same. However, unlike registers, a variable can be scoped to multiple processor cores, indicating that the variable declaration itself and any manipulations should be duplicated across the relevant cores as register operations. The scope of any variable is specified by the list of hardware output channels it influences (through either control flow operations or direct pulse parameterization). ### C. Virtual-Z Instructions and Phase Tracking In general, virtual-Z gates are implemented by applying a phase offset to any subsequent control pulses at the specified qubit frequency. QubiC-IR supports a VirtualZ instruction for this purpose, with two arguments: 1) qubit frequency, and 2) rotation angle (in radians). The qubit frequency can be named (having been previously declared in the program, or defined in the gate calibration file), or anonymous (specified directly using its numerical value). By default VirtualZ instructions are resolved in software; the provided phases are applied directly to the relevant control pulses during compilation. However, hardware (i.e. on-FPGA) resolution is also supported; a BindPhase directive can be used to *bind* the phase of all control pulses at a particular frequency to a declared variable. For example, the snippet: will result in all control pulses with frequency Q0.freq to have their phase parameter given by the variable q0\_phase. Hardware phase parameterization is required for certain dynamic circuit operations, such as conditional/repeated application of Z-gates. #### D. Control Flow QubiC-IR supports both high-level and low-level (assembly-like) control flow. At the high-level, there are two instructions: BranchVar and and Loop. The BranchVar instruction functions as a conditional execution (if/else) statement; the instruction contains an ALU conditional operation to evaluate, and true and false code blocks which conditionally execute depending on the result of the conditional. The true and false blocks can contain any valid IR code, including nested control flow. The Loop instruction consists of an ALU condition, along with body code that executes repeatedly while the condition evaluates to True. High-level control flow instructions are resolved into lower level Jump and JumpCond instructions. These are identical to their assembly-level counterparts, except, like ALU arithmetic instructions, they can be scoped (hence duplicated) across multiple processor cores. After all control flow is lowered to assembly-like control flow, another pass will divide the program into basic blocks, along with the associated control-flow graph (CFG). The CFG is then used by subsequent passes, such as virtual-Z phase resolution and scheduling, to track changes in program state across the full program flow. ### E. Function Processor Special instructions are used to request/receive data over the FPROC interface. As with the assembly language, these instructions extend the normal arithmetic and control flow instructions, replacing the RHS ALU input with the function processor data. The QubiC-IR infrastructure can resolve channel names into an assembly-compatible format, and add the appropriate delays to ensure that the processor core has enough time to receive and process the measurement results. This is done using an FPROCChannelConfig object and associated compiler pass, which contains a mapping of named FPROC channels to an associated measurement delay and channel ID. ## F. Scheduling QubiC-IR provides two instructions for specifying timing relationships between gates/pulses — Delay and Barrier. The Delay instruction delays all subsequent pulses on the ``` {('Q1.qdrv', 'Q1.rdrv', 'Q1.rdlo'): [ {'op': 'phase_reset'}, # readout drive pulse {'op': 'pulse', 'freq': 6.5578e9, 'phase': 0.0, 'amp': 0.041, 'env': { 'env_func': 'cos_edge_square', 'paradict': { 'ramp_fraction': 0.1, 'twidth': 1.6e-06}}, 'start_time': 5, 'dest': 'Q1.rdrv'}, # readout demodulation pulse {'op': 'pulse', 'freq': 6.5578e9, 'phase': 0.0, 'amp': 1.0, 'env': { 'env_func': 'square', 'paradict': { 'phase': 0.0, 'amplitude': 1.0, 'twidth': 1.59e-06}}, 'start_time': 325, 'dest': 'Q1.rdlo'}, # idle to wait for measurement {'op': 'idle', 'end_time': 1184}, # jump instruction; jump to 'true_1' # if measured state is 1 {'op': 'jump_fproc', 'in0': 1, 'alu_op': 'eq', 'jump_label': 'true_1', 'func_id': ('Q1.rdlo', 'core_ind')}, # if state is 0, jump to end {'op': 'jump_label', 'dest_label': 'false_1'}, {'op': 'jump_i', 'jump_label': 'end_1'}, # if state is 1, play pulse {'op': 'jump_label', 'dest_label': 'true_1'}, {'op': 'pulse', 'freq': 4.67035e9, 'phase': 0, 'amp': 0.5, 'env': { 'env_func': 'DRAG', 'paradict': { 'alpha': 0, 'sigmas': 3, 'delta': -260.157e3, 'twidth': 3e-08}}, 'start_time': 1195, 'dest': 'Q1.qdrv'}, # program end {'op': 'jump_label', 'dest_label': 'end_1'}, {'op': 'done_stb'}], ``` Fig. 5: Example assembly code for single-qubit reset. This program initiates a readout on Q1, then conditionally plays a drive pulse depending on the measurement outcome. The assembly program is formatted as a Python/JSON dictionary, with the program for each processor core keyed by a tuple of channels controlled by that core. In this example, we are only using the qubit Q1, which is controlled by the ('Q1.qdrv', 'Q1.rdrv', 'Q1.rdlo') core. ``` # Since we have conditional z-gates, # we need to parameterize the phase of # all Q0 drive pulses with a variable {'name': 'bind_phase', 'var': 'q0_phase', 'qubit': 'Q0'}, # Wait 500 microseconds for qubits to decay {'name': 'delay', 't': 500.e-6} {'name': 'read', 'qubit': ['Q1']}, # Measurement-based conditional branching # operation. Condition being evaluated # is: 'Q1.meas == 1'. Function ID and # associated measurement delays are # resolved by the compiler. {'name': 'branch_fproc', 'cond_lhs': 1, 'alu_cond': 'eq', 'func_id': 'Q1.meas', 'true': [ {'name': 'X90', 'qubit': ['Q0']}, {'name': 'X90', 'qubit': ['Q0']} 'false': [ {'name': 'virtual_z', 'phase': np.pi, 'qubit': 'Q0'} 'scope': ['Q0'],}, # scheduling barrier, then final readout {'name': 'barrier', 'qubit':['Q0', 'Q1']}, {'name': 'barrier', qubit :['Q0', {'name': 'read', 'qubit': ['Q0']}, {'name': 'read', 'qubit': ['Q1']} ``` Fig. 6: Example program with measurement-based control flow. Q1 is measured, and depending on the outcome of a measurement, either a phase flip or a bit flip is applied to Q0. specified channels by the provided amount. The Barrier instruction is similar to an OpenQASM [17] barrier; it aligns the start times of the following pulses to be played on the indicated channels. The QubiC compiler has a scheduling pass, which assigns trigger timestamps to all Pulse instructions. Timestamps are determined by taking into account timing constraints (i.e. delays and barriers), pulse length, and instruction execution time. Running the scheduler is optional; users are free to directly provide each pulse with a timestamp. In this case, a linter pass is provided to ensure that the pulse schedule satisfies the execution constraints of the processor core(s) (for example, a pulse cannot be triggered during an ALU or branching operation). ### VII. FPGA IMPLEMENTATION The processor cores, function processor, and associated interfaces are implemented in Verilog and SystemVerilog. The current implementation is integrated with QubiC 2.0 on the ZCU216 RFSoC platform. Our design is modular; SystemVerilog interfaces are used to connect the processor Fig. 7: Signal path for a single processor core/qubit for the current QubiC 2.0 implementation. The qubit drive channel goes to a dedicated DAC, while the readout drive (demodulation) channels are connected to a common multiplexed readout DAC (ADC). The state-discriminated measurement result gets sent to the function processor module. | | LUT | FF | DSP | BRAM | |----------------|---------------|---------------|-----|------------| | processor core | 387 (0.091 %) | 401 (0.047 %) | 0 | 2 (0.19 %) | | FPROC | 24 (0.006 %) | 56 (0.007 %) | 0 | 0 | Fig. 8: Resource utilization table for a single processor core and full function processor module. Utilization is given as both the absolute number of blocks used and fraction of total utilization for each resource type. The BRAM (block RAM) used by the processor core corresponds to the program memory. Reported values are for the Xilinx ZU49DR FPGA, and were generated using Xilinx Vivado. cores to the QubiC signal generator, memory banks, and measurement modules. A variety of QubiC 2.0 implementations exist, ranging from 4x-16x DAC drive channels, and a variety of signal generator and readout multiplexing configurations [10]. In the following sections, we describe a distributed architecture + QubiC 2.0 implementation designed for the AQT Trailblazer QPU (quantum processing unit), which is an 8-qubit superconducting transmon system, with fixed qubit frequency, fixed coupling, and 8x multiplexed readout [20]. This implementation has eight distributed processor cores; i.e. one for each qubit. Each core controls three signal generator channels (figure 7), one for qubit drive (which goes to a dedicated DAC), another for readout drive (which goes to a common multiplexed readout DAC), and another for readout demodulation (which mixes with the readout resonator response tone from the multiplexed readout ADC). All processor cores, signal generator blocks, and readout demodulation are on the same 500 MHz clock domain. All drive and readout DACs are configured to operate at 8 GSPS (gigasamples per second), and the ADCs at 2 GSPS. #### A. Resource Utilization The FPGA resource requirements for our implementation are reported in figure 8 and can be visualized on the floorplan (figure 9). The logic resource requirements (CLB and DSP) of each processor core are minimal in comparison to the resources required by the corresponding signal processing blocks (i.e. control/readout signal generation and readout demodulation for a single qubit), ensuring that our architecture is unlikely to present a significant scaling bottleneck. The BRAM (block RAM) utilization is largely arbitrary, and depends on Fig. 9: FPGA floorplan of the 8-qubit QubiC 2.0 implementation. Teal-colored cells mark all area utilized by the design (i.e. logic/DSP/memory cells). The highlighted yellow cells mark logic regions used by a single distributed processor core, while green cells mark regions used by the corresponding single-qubit drive and demodulation signal chain. The orange cells highlight block RAM (BRAM) used for processor core program memory. The pink cells mark regions used by the function processor. This floorplan is for the Xilinx ZU49DR FPGA, and was generated using Xilinx Vivado. Fig. 10: Quantum teleportation circuit. This circuit teleports an arbitrary state encoded on qubit Q0 (prepared by the arbitrary single-qubit rotation U) to qubit Q2. Two mid-circuit measurement-based conditional gates are utilized (the final X and Z gates on Q2). the desired circuit depth/program size. In our implementation, single-core BRAM utilization is minimal at 0.2 %, corresponding to a program memory capable of storing 2048 128-bit instruction words. # VIII. EXPERIMENTAL DEMONSTRATION: QUANTUM TELEPORTATION In order to demonstrate the mid-circuit measurement and feedforward capabilities of our architecture, we performed a quantum state teleportation experiment [21]. For this experiment, we used the AQT (Advanced Quantum Testbed) Trailblazer QPU, which has 8 fixed-frequency transmon qubits with linear connectivity [20]. Our teleportation circuit is given in figure 10; we used the BQSKit compiler [22] to translate this circuit into the AQT Trailblazer's native gate set (comprised of $X_{90}$ , CZ, and virtual- $Z(\theta)$ gates). We performed the teleportation experiment for four different initial states on qubit Q0: $|0\rangle$ , $|+\rangle$ , $|-\rangle$ and $|1\rangle$ . The corresponding Z-basis measurement results for qubit Q2 are shown in figure 11 for the $|0\rangle$ and $|1\rangle$ initial Q0 states. For the $|+\rangle$ and $|-\rangle$ states, we also performed measurements in the X and Y bases to determine the position of the state vector in the X-Y plane of the Bloch sphere (figure 12). A table of measured expectation values for the destination qubit Q2 is given in figure 13. For the $|0\rangle$ , $|+\rangle$ , and $|1\rangle$ states, the measured expectation values show good agreement with theoretical predictions, indicating successful teleportation of the quantum state from Q0 to Q2. For the $|-\rangle$ state, the measured expectation values show partial agreement, with significant deviation in the Y-basis. This discrepancy is likely due to a combination of dephasing and readout-induced crosstalk, which are known issues when implementing midcircuit feedforward operations [23], [24], [25]. ## IX. CONCLUSION We have developed an open source FPGA-based architecture for superconducting qubit control and measurement. Our architecture supports the execution of dynamic circuits, Fig. 11: State teleportation measurement results showing Z-basis measurements of the destination qubit (Q2 from figure 10) for initial states $|0\rangle$ and $|1\rangle$ prepared on Q0. 10,000 shots were collected for each measurement. including mid-circuit measurement and feedforward, and realtime parameter updates. We also provide a modular compiler stack and intermediate representation that supports a variety of abstraction levels and can integrate with standard quantum programming tools. Our architecture is deployed on the QubiC 2.0 [10] system, which currently uses the Xilinx ZCU216 RFSoC evaluation board, and has been used to control the 8-qubit Trailblazer QPU at the LBNL AQT. In addition to the state teleportation demonstration presented in this paper, our system has enabled the demonstration of novel scientific results, including randomized compiling for mid-circuit measurement [26], and measurement-based entanglement generation [23]. Our design and compiler stack is fully open source, and can be found on Gitlab: https://gitlab.com/LBL-QubiC/distributed\_processor. #### ACKNOWLEDGMENT This work was supported by the U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research Testbeds for Science program, the National Quantum Information Science Research Centers Quantum Systems Accelerator, and the High Energy Physics QUANTISED program under Contract No. DE-AC02-05CH11231. ### REFERENCES - D. De Jong, B. Kalyoncu, and C. C. Bultink, "Qblox, demonstrating fully-integrated and modular quantum control from the bottom up," in APS March Meeting Abstracts, vol. 2022, 2022, pp. M78–001. - [2] A. Mandelis, "Focus on test, measurement, quantum metrology, and analytical equipment," 2022. - [3] L. Ella, "How to build a scalable quantum controller," Computer, vol. 55, no. 3, pp. 91–94, 2022. - [4] K. H. Park, Y. S. Yap, Y. P. Tan, C. Hufnagel, L. H. Nguyen, K. H. Lau, P. Bore, S. Efthymiou, S. Carrazza, R. P. Budoyo et al., "Icarus-q: Integrated control and readout unit for scalable quantum processors," *Review of Scientific Instruments*, vol. 93, no. 10, 2022. Fig. 12: State teleportation measurement results showing X, Y, and Z basis measurements of the destination qubit (Q2 from figure 10) for initial states $|+\rangle$ and $|-\rangle$ prepared on Q0. 10,000 shots were collected for each measurement. | initial state | basis | expectation value | |---------------|-------|-------------------| | $ 0\rangle$ | Z | 0.822 (0.008) | | $ 1\rangle$ | Z | -0.845 (0.008) | | | X | 0.884 (0.007) | | $ +\rangle$ | Y | 0.017 (0.015) | | | Z | 0.010 (0.015) | | | X | -0.818 (0.008) | | $ -\rangle$ | Y | 0.166 (0.015) | | | Z | 0.034 (0.015) | Fig. 13: State teleportation experimental results. Table of expectation values for the destination qubit Q2 for different Q0 initial preparation states and measurement bases. - [5] L. Stefanazzi, K. Treptow, N. Wilcer, C. Stoughton, C. Bradford, S. Uemura, S. Zorzetti, S. Montella, G. Cancelo, S. Sussman et al., "The qick (quantum instrumentation control kit): Readout and control for qubits and detectors," *Review of Scientific Instruments*, vol. 93, no. 4, 2022. - [6] M. O. Tholén, R. Borgani, G. R. Di Carlo, A. Bengtsson, C. Križan, - M. Kudra, G. Tancredi, J. Bylander, P. Delsing, S. Gasparinetti *et al.*, "Measurement and control of a superconducting quantum processor with a fully integrated radio-frequency system on a chip," *Review of Scientific Instruments*, vol. 93, no. 10, 2022. - [7] Y. Xu, G. Huang, J. Balewski, R. Naik, A. Morvan, B. Mitchell, K. Nowrouzi, D. I. Santiago, and I. Siddiqi, "QubiC: An open-source FPGA-based control and measurement system for superconducting quantum information processors," *IEEE Transactions on Quantum En*gineering, vol. 2, pp. 1–11, 2021. - [8] P. Deliyannis, J. Sud, D. Chamaki, Z. Webb-Mack, C. W. Bauer, and B. Nachman, "Improving quantum simulation efficiency of final state radiation with dynamic quantum circuits," *Physical Review D*, vol. 106, no. 3, p. 036007, 2022. - [9] A. D. Córcoles, M. Takita, K. Inoue, S. Lekuch, Z. K. Minev, J. M. Chow, and J. M. Gambetta, "Exploiting dynamic quantum circuits in a quantum algorithm with superconducting qubits," *Physical Review Letters*, vol. 127, no. 10, p. 100501, 2021. - [10] Y. Xu, G. Huang, N. Fruitwala, A. Rajagopala, R. K. Naik, K. Nowrouzi, D. I. Santiago, and I. Siddiqi, "Qubic 2.0: An extensible open-source qubit control system capable of mid-circuit measurement and feedforward," arXiv preprint arXiv:2309.10333, 2023. - [11] J. Werschnik and E. Gross, "Quantum optimal control theory," *Journal of Physics B: Atomic, Molecular and Optical Physics*, vol. 40, no. 18, p. R175, 2007. - [12] B. Farley, J. McGrath, and C. Erdmann, "An all-programmable 16-nm - rfsoc for digital-rf communications," *IEEE Micro*, vol. 38, no. 2, pp. 61–71, 2018. - [13] G. Kane and J. Heinrich, MIPS RISC architectures. Prentice-Hall, Inc., 1992 - [14] N. Fruitwala. (2023) Distributed processor instruction set. [Online]. Available: https://gitlab.com/LBL-QubiC/distributed\_processor/-/wikis/ Instruction-Set - [15] D. Crockford and C. Morningstar, "Ecma-404 the json data interchange syntax," Geneva: ECMA International, 2017. - [16] True-q. [Online]. Available: https://trueq.quantumbenchmark.com/ - [17] A. Cross, A. Javadi-Abhari, T. Alexander, N. De Beaudrap, L. S. Bishop, S. Heidel, C. A. Ryan, P. Sivarajah, J. Smolin, J. M. Gambetta *et al.*, "Openqasm 3: A broader and deeper quantum assembly language," *ACM Transactions on Quantum Computing*, vol. 3, no. 3, pp. 1–50, 2022. - [18] E. Nielsen, K. Rudinger, T. Proctor, A. Russo, K. Young, and R. Blume-Kohout, "Probing quantum processor performance with pygsti," *Quantum science and Technology*, vol. 5, no. 4, p. 044002, 2020. - [19] N. Fruitwala. (2024) Qubic-ir languange reference. [Online]. Available: https://lbl-qubic.gitlab.io/distributed\_processor/ - [20] J. Kreikebaum, K. O'Brien, A. Morvan, and I. Siddiqi, "Improving wafer-scale josephson junction resistance variation in superconducting quantum coherent circuits," *Superconductor Science and Technology*, vol. 33, no. 6, p. 06LT02, 2020. - [21] C. H. Bennett, G. Brassard, C. Crépeau, R. Jozsa, A. Peres, and W. K. Wootters, "Teleporting an unknown quantum state via dual classical and einstein-podolsky-rosen channels," *Physical review letters*, vol. 70, no. 13, p. 1895, 1993. - [22] E. Younis, C. C. Iancu, W. Lavrijsen, M. Davis, and E. Smith, "Berkeley quantum synthesis toolkit (bqskit) v1," Lawrence Berkeley National Laboratory (LBNL), Berkeley, CA (United States), Tech. Rep., 2021. - [23] A. Hashim, M. Yuan, P. Gokhale, L. Chen, C. Juenger, N. Fruitwala, Y. Xu, G. Huang, L. Jiang, and I. Siddiqi, "Efficient generation of multipartite entanglement between non-local superconducting qubits using classical feedback," 2024. - [24] J. Gambetta, A. Blais, D. I. Schuster, A. Wallraff, L. Frunzio, J. Majer, M. H. Devoret, S. M. Girvin, and R. J. Schoelkopf, "Qubit-photon interactions in a cavity: Measurement-induced dephasing and number splitting," *Physical Review A*, vol. 74, no. 4, p. 042318, 2006. - [25] B. K. Mitchell, R. K. Naik, A. Morvan, A. Hashim, J. M. Kreikebaum, B. Marinelli, W. Lavrijsen, K. Nowrouzi, D. I. Santiago, and I. Siddiqi, "Hardware-efficient microwave-activated tunable coupling between superconducting qubits," *Physical review letters*, vol. 127, no. 20, p. 200502, 2021. - [26] A. Hashim, A. Carignan-Dugas, L. Chen, C. Juenger, N. Fruitwala, Y. Xu, G. Huang, J. J. Wallman, and I. Siddiqi, "Quasi-probabilistic readout correction of mid-circuit measurements for adaptive feedback via measurement randomized compiling," 2024.