decouple

It is used to break combinatorial paths on both data and control signals of the DTI interface, without sacrificing the throughput. It adds a latency of one clock cycle to the path, but does not impact the throughput no matter the pattern in which the data is written and read from it. It is transparent to the functionality of the design.

decouple(din, *, depth=2) → din

The decouple() gear is basically a FIFO with no combinatorial loops between its input and output. It is used in the following cases:

  1. When there is a loop in the data path, where it is used to prevent the combinatorial loops

  2. For pipelining:

Example design features a rng() generator, whose output values are led to an incrementer. Both the value generation and the addition are performed in a single clock cycle.

rng_vals = drv(t=Uint[4], seq=[6]) | rng | flatten
rng_vals_incr = rng_vals + 1

rng_vals_incr | check(ref=[1, 2, 3, 4, 5, 6])

In order to reduce the combinatorial path lengths in the design, we might split these two operations in two clock cycles. The decouple() gear cuts the combinatorial paths on both data and control interface signals, it does not impact the design throughput, but adds a single clock cycle of latency:

rng_vals = drv(t=Uint[4], seq=[6]) | rng | flatten
rng_vals_incr = (rng_vals | decouple) + 1

rng_vals_incr | check(ref=[1, 2, 3, 4, 5, 6])
  1. For balancing latencies on the datapath branches:

Consider a datapath consisting of two branches whose outputs are later concatenated together. First branch performs some arithmetic operations with registers added for pipelining, and has a latency of two clock cycles. The second branch does nothing to the data and has zero latency. Due to the mismatch in the pipeline depths of the two branches, the resulting throughput is 1 data value per 3 clock cycles.

inp = drv(t=Uint[4], seq=[1, 2, 3])

branch1 = dreg(dreg(inp + 1) * 3)
branch2 = inp

ccat(branch1, branch2) | check(ref=[(6, 1), (9, 2), (12, 3)])

By introducing a decoupler on the second branch (default depth of two is enough here), we have achieved maximum throughput after the initial latency.

inp = drv(t=Uint[4], seq=[1, 2, 3])

branch1 = dreg(dreg(inp + 1) * 3)
branch2 = inp | decouple

ccat(branch1, branch2) | check(ref=[(6, 1), (9, 2), (12, 3)])