slow_fir_filter

This document contains technical documentation for the slow_fir_filter module.

Version

This file is part of slow_fir_filter release 0.0.0 collected at 2022-01-07 12:03.

Releases from Truestream follow the semantic versioning scheme: MAJOR.MINOR.PATCH+HASH.

  • MAJOR will be incremented for incompatible API or functionality changes.

  • MINOR will be incremented when new functionality is added in a backwards compatible manner.

  • PATCH will be incremented for backwards compatible bug fixes.

The HASH field is the git sha that the release was made from. It is included in the version number for internal traceability.

Release notes

Changelog and release history for the slow_fir_filter module. Changelogs from Truestream follow the keep a changelog format.

Unreleased

Update documentation with note on handshake rules.

3.0.0+fd52a59f - (17 september 2020)

Added

  • Make coefficient set selectable per channel.

2.0.0+3c9f30b2 - (16 september 2020)

Added

  • Add support for multiple coefficient sets.

1.0.1+ba513de5 - (28 august 2020)

Changes

  • Documentation fixes

Requirements

This module has the following dependencies:

  • The open-source hdl_modules project version 1.0.0.

  • The Truestream module fir_filter_common version 0.0.1.

Library name

This module’s source files shall be compiled to a VHDL library symbolically named slow_fir_filter.

Overview

This module provides a one-DSP FIR filter implementation. It is suitable for filtering of multiple input channels where the total data rate is significantly slower than the system clock. A key feature is that the channels can be independent from each other, without any known timing relationship between the data. The module also supports multiple coefficient sets, that can be switched in real time.

Downsampling and upsampling is supported. Downsampling is performed after the filter, and upsampling before the filter. Downsampling and upsampling can not be used simultaneously.

The top level to be instantiated by the user is slow_fir_filter, which has these generics:

  • num_channels: The number of channels.

  • input_data_width: Input data width.

  • num_coefficient_sets: The number of coefficient sets that will be used.

  • num_coefficients_per_set: The number of coefficients of the FIR filter. That is, the order of the filter.

  • coefficients: An integer vector with all coefficients.

  • upsampling: The upsampling factor. Default 1 (no upsampling).

  • downsampling: The downsampling factor. Default is 1 (no downsampling).

  • result_data_width: Size of the result data. If a different width is used than that of the accumulated data, data is either padded or truncated to match. See Data sizing for more information.

Interface

There is an input port, with ready and valid handshake signals, for each channel. The result data port is the same for all outputs, but each channel has its own set of handshake signals for the result.

Using AXI4-Stream-like handshake interfaces (ready and valid to qualify data transactions) is very common in FPGA designs. It enables a backpressure situation where the slave, i.e. the receiver of data, can indicate when it is ready to receive the data.

Below are some rules governing how these handshake signals interact. They are adapted from the AMBA 4 AXI4-Stream Protocol Specification, ARM IHI 0051A (ID030610).

  1. A transactions occurs on the positive edge of the clock when both ready and valid are high. The graph below shows some typical transactions.

    _images/wavedrom-796e9942-546f-49fb-bb0f-b1c29c7dcbf2.svg
  2. The ready signal may fall without a transaction having occured:

    _images/wavedrom-f056dd8a-3808-4d3c-b17f-ab51e5141888.svg
  1. The valid signal may NOT fall without a transaction having occured:

    _images/wavedrom-e1083666-9ddc-46cd-ad06-0b03bce777dd.svg
  2. Once valid is asserted, the associated data may NOT be changed unless a transaction has occurred.

    _images/wavedrom-2662dc3c-bcef-45ba-8493-88fe061e9b20.svg

    This applies to any auxillary signals associated with the bus as well, e.g. a last indicator.

    Note also that this restriction on data not changing only applies when valid is asserted. When it is not, the data may be changed freely.

  3. In order to avoid deadlock situations, the master may NOT wait for the slave to assert ready before asserting valid. The slave however may wait for valid before asserting ready.

Data sizing

It is up to the user to set the width such that the desired resource usage is achieved. If the goal is to infer one DSP48E2, the width of the accumulated data must be at most 48 bits.

When not upsampling, the accumulator width is calculated as:

\[\text{accumulator width} = \text{input_data_width} + \left\lceil\log_2 \left(\sum_{n=0}^{n=N-1}|\text{coefficient}_n| \right) \right\rceil\]

, where \(N\) is the number coefficients. This takes into account that the size grows depending on the values of the coefficients, not just the number of coefficients.

This is set by default by the filter, but may be overridden by the user. If a smaller result_width is set, the lowest significant bits are truncated. The user can get the default value from the calc_accumulator_width function of the slow_fir_filter_pkg package.

If upsampling is performed, only some of the coefficients will be used each pass, and calc_accumulator_width takes this into account.

Throughput

The filter uses only one DSP to serve all channels, and most processing cycles are spent calculating the filter taps. There are also a few cycles overhead before a calculation starts.

No resampling

When no up- or downsampling is performed, the number of cycles needed to calculate one output is:

\[\text{cycles per output} = \text{num_coefficients} + 4\]

This assumes that there is always data valid on all inputs, and that the outputs are always ready to accept data.

Upsampling

When upsampling, each input sample results in upsampling number of outputs. These outputs are calculated directly when the input arrives, before proceeding to the next channel.

The number of cycles needed to calculate one output is:

\[\text{cycles per output} = \left \lceil{\text{num_coefficients} / \text{upsampling}} \right \rceil + 4\]

Downsampling

When downsampling, only every downsampling inputs results in an output. The number of cycles needed to calculate one output is the same as when no resampling is performed:

\[\text{cycles per output} = \text{num_coefficients} + 4\]

Inputs that don’t result in an output still needs to be stored, which consumes four cycles.

Design details

Below follows a description of the different sub-modules.

macc.vhd

component macc is
  generic (
    num_accumulations : positive;
    data_a_width : positive;
    data_b_width : positive;
    accumulator_width : positive;
    result_width : positive
  );
  port (
    clk : in std_logic;
    --# {{}}
    data_valid : in std_logic;
    data_a : in signed;
    data_b : in signed;
    --# {{}}
    result_valid : out std_logic;
    result : out signed
  );
end component;

Multiply accumulate (MACC) which automatically clears the accumulated result after a constant number of accumulations. The inputs are two signed numbers of any width, and the result width is configurable.

  • result_width sets the output width. Truncation is performed if this is smaller than the accumulator width.

  • num_accumulations sets the number of accumulations the MACC performs before resetting to 0.

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_slow_fir_filter.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for macc.vhd netlist builds.

Generics

DSP Blocks

Total LUTs

FFs

data_a_width = 25

data_b_width = 18

num_accumulations = 1

result_width = 43

1

< 3

< 5

data_a_width = 25

data_b_width = 18

num_accumulations = 1

result_width = 47

1

< 3

< 5

data_a_width = 25

data_b_width = 18

num_accumulations = 1

result_width = 39

1

< 3

< 5

data_a_width = 25

data_b_width = 18

num_accumulations = 7

result_width = 46

1

< 3

< 5

data_a_width = 25

data_b_width = 18

num_accumulations = 7

result_width = 50

1

< 3

< 5

data_a_width = 25

data_b_width = 18

num_accumulations = 7

result_width = 42

1

< 3

< 5

slow_fir_filter.vhd

component slow_fir_filter is
  generic (
    num_channels : positive;
    input_data_width : positive;
    num_coefficient_sets : positive;
    num_coefficients_per_set : positive;
    -- All coefficients in ascending order .
    coefficients : integer_vector;
    upsampling : positive;
    downsampling : positive;
    result_width : positive
      input_data_width=>input_data_width,
      coefficients=>coefficients,
      num_coefficient_sets=>num_coefficient_sets,
      upsampling=>upsampling)
  );
  port (
    clk : in std_logic;
    --# {{}}
    -- When using only one set, this one can be ignored.
    coefficient_set_select : in integer_vector;
    --# {{}}
    input_ready : out std_logic_vector;
    input_valid : in std_logic_vector;
    -- Using integer_vector since there is currently a bug in ghdl related to unconstrained arrays.
    -- The integers shall use only the signed range of input_data_width bits.
    input_data : in integer_vector;
    --# {{}}
    result_ready : in std_logic_vector;
    result_valid : out std_logic_vector;
    result_data : out signed
  );
end component;

Top level for the slow FIR filter.

The input_data port uses integer_vector instead of an array unsigned_vector(0 to num_channels - 1)(input_data_width - 1 downto 0) which would be more suitable. This is due to a bug in GHDL related to unconstrained arrays in VHDL-2008. For the same reason, the coefficients generic is a one dimensional vector instead of a matrix. See e.g. https://github.com/ghdl/ghdl/issues/1224

Resource utilization

This entity has netlist builds set up with automatic size checkers in module_slow_fir_filter.py. The following table lists the resource utilization for the entity, depending on generic configuration.

Resource utilization for slow_fir_filter.vhd netlist builds.

Generics

DSP Blocks

Total LUTs

FFs

RAMB36

RAMB18

num_channels = 1

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 1

downsampling = 1

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 150

< 100

0

1

num_channels = 1

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 4

downsampling = 1

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 150

< 100

0

1

num_channels = 1

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 1

downsampling = 4

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 150

< 100

0

1

num_channels = 4

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 1

downsampling = 1

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 200

< 100

1

0

num_channels = 4

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 4

downsampling = 1

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 200

< 100

0

1

num_channels = 4

input_data_width = 25

num_coefficients_per_set = 255

upsampling = 1

downsampling = 4

result_width = 48

(Using wrapper

slow_fir_filter_netlist_build_wrapper.vhd)

1

< 200

< 100

1

0