An Approximate Closed-Form Expression for Calculating Performance of Floating-Point Format for the Laplacian Source

Perić, Zoran; Denić, Bojan; Dinčić, Milan; Perić, Sofija

doi:10.15388/25-INFOR587

Informatica

An Approximate Closed-Form Expression for Calculating Performance of Floating-Point Format for the Laplacian Source

Volume 36, Issue 1 (2025), pp. 125–140

Zoran Perić Bojan Denić Milan Dinčić Sofija Perić

https://doi.org/10.15388/25-INFOR587

Pub. online: 18 March 2025 Type: Research Article

Open Access

Received
1 October 2024

Accepted
1 March 2025

Published
18 March 2025

Abstract

This paper introduces a novel approach that bridges the floating-point (FP) format, widely utilized in diverse fields for data representation, with the μ-law companding quantizer, proposing a method for designing and linearizing the μ-law companding quantizer to yield a piecewise uniform quantizer tailored to the FP format. A key outcome of the paper is a closed-form approximate expression for closely and efficiently evaluating the FP format’s performance for data with the Laplacian distribution. This expression offers generality across various bit rates and data variances, markedly reducing the computational complexity of FP performance evaluation compared to prior methods reliant on summation of a large number of terms. By facilitating the evaluation of FP format performance, this research substantially aids in the selection of the optimal bit rates, crucial for digital representation quality, dynamic range, computational overhead, and energy efficiency. The numerical calculations spanning a wide range of data variances provided for some commonly used FP versions with an 8-bit exponent have demonstrated that the proposed closed-form expression closely approximates FP format performance.

1 Introduction

The floating-point (FP) format is extensively employed for data representation across various domains, including computing (Fasi and Mikaitis, 2021; Burgess et al., 2019a), neural networks (Zhao et al., 2023; Bai-Kui and Shanq-Jang, 2023), and signal processing (Moroz and Samotyy, 2019). The prevalent 32-bit floating-point (FP32) format adheres to standardized specifications (IEEE 754, 2019), boasting exceptional digital representation quality across a very wide range of data variance, ranging from minuscule to substantial values. However, the FP32 format’s computational intensity poses a challenge for implementation on hardware-constrained devices (Yang et al., 2022; Syed et al., 2021; Cattaneo et al., 2018). The 24-bit FP (FP24) (Junaid et al., 2022), 16-bit FP (Bfloat16 and DLFloat) (Burgess et al., 2019b; Agrawal et al., 2019), and 8-bit (FP8) (Wang et al., 2018) formats are examples of lower-bit FP formats that reduce computational complexity and energy consumption, making them advantageous for hardware and energy-restricted systems. Conversely, formats such as 64-bit FP (FP64) (IEEE 754, 2019) are utilized in environments necessitating heightened calculation precision (Botta et al., 2021).

As evident, there exists a plethora of FP formats, each with varying bit rates, offering distinct qualities in digital representation, dynamic range, computational complexity, and energy usage. Selecting the optimal FP format is a crucial task in both research and practical applications, contingent upon several factors: the required digital representation accuracy for a specific application, the range of data variance, as well as the available hardware and energy resources. Generally, it is preferable to opt for an FP format with fewer bits to minimize hardware demands and energy consumption while ensuring the requisite level of representation accuracy for a specific application across the entire range of data variance. Achieving this necessitates an efficient mechanism for evaluating the performance of FP formats across different bit rates and data variance levels.

It is worth noting that none of the aforementioned papers dealing with the FP formats (Junaid et al., 2022; Agrawal et al., 2019; Burgess et al., 2019a; Wang et al., 2018; Botta et al., 2021) provide information regarding their actual performance, which is a critical factor for practical applications. A significant stride in this direction was made in Perić et al. (2021), Denić et al. (2023), where the correlation was established between the FP format and a piecewise uniform quantizer, which was termed the floating-point quantizer (FPQ). Namely, the piecewise uniform quantizer includes a number of segments, where a unique uniform quantizer is defined in each segment (Dinčić et al., 2016; Jayant and Noll, 1984; Gersho and Gray, 1992). This actually allowed assessing the digital representation quality of the FP format using an objective performance measure such as the signal-to-quantization-noise ratio (SQNR) of FPQ. It’s crucial to acknowledge that the performance of the FP format, specifically the SQNR of FPQ, relies heavily on the statistical properties of the data, primarily the probability density function (PDF). This paper considers the Laplacian PDF, given its extensive usage in statistically modelling various data types, e.g., speech (Chu, 2003; Gazor and Zhang, 2003) and neural network weights (Banner et al., 2018; 2019).

The primary goal of this paper is to make a significant advancement towards the FP format analysis by providing a performance-evaluating method that is more efficient (in terms of computational complexity) compared to the previously developed method (Perić et al., 2021; Denić et al., 2023). This is achieved by linking the FP format with the μ-law companding quantizer (μCQ), which is actually a novel concept, as research on this topic has not been done before. Namely, the SQNR expression for FPQ in Perić et al. (2021), Denić et al. (2023) is not provided in a closed form but as the summation of numerous terms (e.g., for the FP32 format, this sum comprises 254 terms, Perić et al., 2021), thereby escalating the complexity of FP format performance computation. Hence, the paper aims to eliminate the mentioned drawback of the existing method for assessing the performance of the FP format. A significant contribution is the development of a procedure for designing a μCQ, tailoring its key parameters (μ-compression factor and ${x_{\max }}$ – maximal amplitude) to the FP format. The key outcome of this innovative approach is the provision of a simple closed-form approximate expression for closely and efficiently assessing the FP format’s performance. The advantage of this closed-form expression is its broad applicability, as it applies universally to any bit rate and data variance. Aside from the theoretical significance of deriving a closed-form expression for performance evaluation, this paper holds substantial practical value by considerably simplifying the complexity of computing FP format performance.

The paper’s methodology involves designing an appropriate μCQ, linearizing it, and deriving a piecewise uniform quantizer based on the μ-law compression function (PUQ^μ). The paper demonstrates that by selecting the appropriate values of the crucial design parameters of the μCQ, the structure of its linearized version, PUQ^μ, aligns with the FPQ structure. Notably, the paper provides a closed-form expression for the SQNR of μCQ for the Laplacian PDF, obtained by simplifying the general SQNR expression for μCQ provided in Perić et al. (2010). The accuracy of the derived closed-form SQNR expression for μCQ is examined considering versions of the FP format with 8-bit exponent, FP24 and FP32, and a very wide dynamic range of input data variances. It is shown that the proposed SQNR expression is highly efficient in estimating FP performance when confronted with the existing approach (Perić et al., 2021; Denić et al., 2023), with the SQNR calculation error below 1% defining the reasonable accuracy of the SQNR formula (Na, 2011). Thus, utilizing the proposed approach instead of the previously introduced one based on a summation of numerous terms ensures a high level of accuracy and leads to a noteworthy reduction in computational complexity.

The rest of the paper is organized as follows. In Section 2, the description of the R-bit FP format is provided, and its connection with the piecewise uniform quantization is explained. The main result is exposed in Section 3, which performs the design of the μCQ along with its linearized version tailored to the FP format and provides the closed-form expression for estimating FP format performance. Section 4 presents simulation results and highlights the benefits of the approach studied in the paper. Section 5 gives concluding remarks.

2 Description of the Floating-Point Format

A real number x is encoded in the R-bit FP format as IEEE 754 (2019):

(1)

\[ x={(s{a_{e-1}}\dots {a_{1}}{a_{0}}{b_{m-1}}\dots {b_{1}}{b_{0}})_{2}},\]

consisting of one bit s to indicate the sign, e bits (${a_{e-1}}\dots {a_{1}}{a_{0}}$) to represent the exponent E, and m bits (${b_{m-1}}\dots {b_{1}}{b_{0}}$) to represent the significand M of the number x, whereas $R=e+m+1$. The exponent $E={\textstyle\sum _{i=0}^{e-1}}{a_{i}}{2^{i}}$ can take values from 0 to ${2^{e}}-1$, but the values $E=0$ and $E={2^{e}}-1$ are reserved according to IEEE 754 (2019), leaving ${L^{\textit{FP}}}={2^{e}}-2$ values of E (from 1 to ${2^{e}}-2$) that can be used to represent numbers. The parameter $M={\textstyle\sum _{i=1}^{m}}{b_{m-i}}{2^{m-i}}$ can take values from 0 to ${2^{m}}-1$. The number x, represented with (1), can be calculated in its decimal form as IEEE 754 (2019):

(2)

\[ x={(-1)^{s}}{2^{{E^{\ast }}}}\bigg(1+\frac{M}{{2^{m}}}\bigg),\]

where ${E^{\ast }}=E-\textit{bias}$ denotes the biased exponent and $\textit{bias}={L^{\textit{FP}}}/2$ is a predefined parameter. Therefore, the biased exponent ${E^{\ast }}$ takes values from ${E_{\min }^{\ast }}=1-{L^{\textit{FP}}}/2$ to ${E_{\max }^{\ast }}={L^{\textit{FP}}}/2$. For example, for FP32 we have $e=8$ and $m=23$ (IEEE 754, 2019), while for FP24 we have $e=8$ and $m=15$. Due to the same e value, both FP32 and FP24 formats have identical values for the following parameters: ${L^{\textit{FP}}}=254$, $\textit{bias}=127$, ${E_{\min }^{\ast }}=-126$, and ${E_{\max }^{\ast }}=127$.

The R-bit FP format exhibits symmetry around 0, as every positive number in the format corresponds to a symmetric negative number. Let’s examine positive numbers within the R-bit FP format, without losing generality. The maximum positive number representable in this format (for ${E^{\ast }}={E_{\max }^{\ast }}$ and $M={2^{m}}-1$) is:

(3)

\[ {x_{\max }^{\textit{FP}}}={2^{{E_{\max }^{\ast }}}}\bigg(1+\frac{{2^{m}}-1}{{2^{m}}}\bigg)={2^{{E_{\max }^{\ast }}}}\bigg(2-\frac{1}{{2^{m}}}\bigg)\approx {2^{{E_{\max }^{\ast }}+1}}={2^{{L^{\textit{FP}}}/2+1}}.\]

For each value of ${E^{\ast }}$ (${E_{\min }^{\ast }}\leqslant {E^{\ast }}\leqslant {E_{\max }^{\ast }}$) we define a segment ${S_{{E^{\ast }}}}=[{2^{{E^{\ast }}}},{2^{{E^{\ast }}+1}}$) of width ${\delta _{{E^{\ast }}}}={2^{{E^{\ast }}}}$, which includes ${2^{m}}$ equidistant real numbers ${2^{{E^{\ast }}}}(1+\frac{M}{{2^{m}}})$, $M=0,\dots ,{2^{m}}-1$, placed at a mutual distance ${\Delta _{{E^{\ast }}}}={2^{{E^{\ast }}}}(1+\frac{M+1}{{2^{m}}})-{2^{{E^{\ast }}}}(1+\frac{M}{{2^{m}}})={2^{{E^{\ast }}-m}}$. Hence, in the positive part of the real axis, there are a total of ${L^{\textit{FP}}}$ segments ${S_{{E^{\ast }}}}$, each containing ${2^{m}}$ equidistant numbers with a step size of ${\Delta _{{E^{\ast }}}}$. Due to symmetry, the same structure of ${L^{\textit{FP}}}$ segments with ${2^{m}}$ equidistant numbers also exists in the negative part of the real axis. Since

(4)

\[ {\delta _{{E^{\ast }}+1}}={2^{{E^{\ast }}+1}}=2\cdot {2^{{E^{\ast }}}}=2{\delta _{{E^{\ast }}}}\]

and

(5)

\[ {\Delta _{{E^{\ast }}+1}}={2^{{E^{\ast }}+1-m}}=2\cdot {2^{{E^{\ast }}-m}}=2{\Delta _{{E^{\ast }}}},\]

it can be concluded that the width of segment ${S_{{E^{\ast }}+1}}$ is twice as large as the width of segment ${S_{{E^{\ast }}}}$, and the distance between adjacent numbers in ${S_{{E^{\ast }}+1}}$ is twice as high as in ${S_{{E^{\ast }}}}$. Therefore, as the value of ${E^{\ast }}$ increases, the distance between adjacent numbers increases, meaning that the FP format provides a finer representation of smaller numbers.

The described structure of the FP format fully corresponds to the structure of a symmetric piecewise uniform quantizer with a maximum amplitude ${x_{\max }^{\textit{FP}}}$ defined with (3), which in the positive part has ${L^{\textit{FP}}}$ segments ${S_{{E^{\ast }}}}=[{2^{{E^{\ast }}}},{2^{{E^{\ast }}+1}}$), ${E_{\min }^{\ast }}\leqslant {E^{\ast }}\leqslant {E_{\max }^{\ast }}$, each segment undergoing uniform quantization with ${2^{m}}$ quantization levels and with the step size ${\Delta _{{E^{\ast }}}}={2^{{E^{\ast }}-m}}={2^{{E^{\ast }}-R+e+1}}$. This model of quantizer, whose structure mirrors that of the FP format, is known as the floating-point quantizer − FPQ (Perić et al., 2021; Denić et al., 2023). This analogy between the FP format and the FPQ is significant, enabling the FP representation quality to be assessed using an objective measure such as SQNR of the FPQ. SQNR is generally defined as Jayant and Noll (1984), Chu (2003), Gersho and Gray (1992):

(6)

\[ \mathrm{SQNR}(\sigma )=10{\log _{10}}\frac{{\sigma ^{2}}}{D(\sigma )},\]

where ${\sigma ^{2}}$ represents the variance of data to be quantized and D (σ) is distortion that represents an error that occurred during quantization. In the case of FPQ, ${\sigma ^{2}}$ represents the variance of data to be represented in the FP format, while distortion of FPQ represents the error that occurred during FP representation of real numbers and can be expressed in general form as Perić et al. (2021), Denić et al. (2023):

(7)

\[ {D^{\mathrm{FPQ}}}(\sigma )=\underset{{D_{g}^{\mathrm{FPQ}}}(\sigma )}{\underbrace{2{\sum \limits_{{E^{\ast }}={E_{\min }^{\ast }}}^{{E_{\max }^{\ast }}}}\frac{{\Delta _{{E^{\ast }}}^{2}}}{12}{P_{{E^{\ast }}}}(\sigma )}}+\underset{{D_{ov}^{\mathrm{FPQ}}}(\sigma )}{\underbrace{2{\int _{{x_{\max }^{\textit{FP}}}}^{+\infty }}{\big(x-{x_{\max }^{\textit{FP}}}\big)^{2}}p(x,\sigma )dx}}.\]

Multiplication by 2 in the expression (7) is used to account for the distortion in the negative part of the real axis. The first term in (7), expressed as a sum, represents the granular distortion ${D_{g}^{\mathrm{FPQ}}}$ in ${L^{\textit{FP}}}$ segments ${S_{{E^{\ast }}}}$ (${E_{\min }^{\ast }}\leqslant {E^{\ast }}\leqslant {E_{\max }^{\ast }}$), where ${P_{{E^{\ast }}}}(\sigma )={\textstyle\int _{{2^{{E^{\ast }}}}}^{{2^{({E^{\ast }}+1)}}}}\hspace{-0.1667em}\hspace{-0.1667em}p(x,\sigma )dx$ represents the probability that the real number x belongs to segment ${S_{{E^{\ast }}}}$, with $p(x,\sigma )$ representing the PDF of the input data. The second term in (7) represents the overload distortion ${D_{ov}^{\mathrm{FPQ}}}$ that occurs during quantization of numbers outside the support region of the FPQ.

This paper examines the zero-mean Laplacian PDF of variance ${\sigma ^{2}}$, defined as Jayant and Noll (1984), Gersho and Gray (1992):

(8)

\[ p(x,\sigma )=\frac{1}{\sqrt{2}\sigma }\exp \bigg(-\frac{\sqrt{2}|x|}{\sigma }\bigg).\]

For $p(x,\sigma )$ defined with (8), based on (3), (6), and (7), the following SQNR expression for the FPQ quantizer is obtained:

(9)

\[\begin{aligned}{}{\mathrm{SQNR}^{\mathrm{FPQ}}}(\sigma )& =-10{\log _{10}}\bigg[{\sum \limits_{{E^{\ast }}=1-{L^{\textit{FP}}}/2}^{{L^{\textit{FP}}}/2}}\frac{{2^{2({E^{\ast }}-R+e)}}}{3{\sigma ^{2}}}\bigg(\exp \bigg(-\frac{{2^{{E^{\ast }}+1/2}}}{\sigma }\bigg)\\ {} & \hspace{1em}-\exp \bigg(-\frac{{2^{{E^{\ast }}+3/2}}}{\sigma }\bigg)\bigg)+\exp \bigg(-\frac{{2^{({L^{\textit{FP}}}+3)/2}}}{\sigma }\bigg)\bigg].\end{aligned}\]

Using (9), it is possible to compute the performance of the R-bit FP format for any value of data variance. However, expression (9) contains the sum of ${L^{\textit{FP}}}$ elements, being computationally demanding since ${L^{\textit{FP}}}$ is typically a large number (Perić et al., 2021; Denić et al., 2023). This issue will be solved in the next section, where an approximate closed-form expression is supplied for efficiently calculating the performance of the R-bit FP format.

A key outcome of this section is a simple closed-form SQNR expression for an appropriately designed μ-law companding quantizer (μCQ) that can be used as a very close performance approximation for FP formats, reducing the complexity associated with calculating the performance of FP formats explained in Section 2. In the following, we give the design of a μCQ in such a way that its linearization yields a piecewise uniform quantizer (PUQ^μ) whose structure closely resembles that of the FPQ. It will be shown that the performance of μCQ and PUQ^μ are very close, providing a basis for utilizing the derived SQNR formula of μCQ as a very good approximation of FP formats’ performance.

3.1 Design of a μ-Law Companding Quantizer Inspired by the FP Format

Companding quantizers are typically implemented as a cascade connection compressor–uniform quantizer–expander. For a symmetric μCQ, the compressor function ${c_{\mu }}(x):[-{x_{\max }},{x_{\max }}]\to [-{x_{\max }},{x_{\max }}]$ is defined as Jayant and Noll (1984), Gersho and Gray (1992):

(10)

\[ {c_{\mu }}(x)=\frac{{x_{\max }}}{\ln (1+\mu )}\ln \bigg(1+\frac{\mu |x|}{{x_{\max }}}\bigg)\operatorname{sgn}(x),\hspace{1em}0\leqslant |x|\leqslant {x_{\max }},\]

where μ is a compression factor and ${x_{\max }}$ is the maximal amplitude of the quantizer. The decision thresholds ${x_{j}^{\mu }}$ and representational levels ${y_{j}^{\mu }}$ of the μCQ quantizer in the positive part of the real axis can be specified in the following way (Dinčić et al., 2021; Perić et al., 2010):

(11)

\[\begin{aligned}{}& {x_{j}^{\mu }}={c_{\mu }^{-1}}\bigg(2j\frac{{x_{\max }}}{N}\bigg)=\frac{{x_{\max }}}{\mu }\big({(1+\mu )^{\frac{2j}{N}}}-1\big),\hspace{1em}0\leqslant j\leqslant N/2,\end{aligned}\]

(12)

\[\begin{aligned}{}& {y_{j}^{\mu }}={c_{\mu }^{-1}}\bigg((2j-1)\frac{{x_{\max }}}{N}\bigg)=\frac{{x_{\max }}}{\mu }\big({(1+\mu )^{\frac{2j-1}{N}}}-1\big),\hspace{1em}1\leqslant j\leqslant N/2,\end{aligned}\]

where N denotes the number of representational levels, while ${\Delta _{u}}=2{x_{\max }}/N$ defines the step size of the uniform quantizer and ${c_{\mu }^{-1}}(x)$ is the inverse μ-law compression function that defines the expander. Note that the decision thresholds and representational levels of the μCQ depend on the parameters μ and ${x_{\max }}$, whose values will be selected based on the condition that PUQ^μ, as a linearized version of the μCQ, has the same structure as the FPQ.

The next step is the piecewise linearization of the μCQ, achieved by approximating the compressor function ${c_{\mu }}(x)$ defined with (10) by a symmetric piecewise linear compressor function ${g_{\mu }}(x):[-{x_{\max }},{x_{\max }}]\to [-{x_{\max }},{x_{\max }}]$. Due to the symmetry of ${g_{\mu }}(x)$ around 0, we can consider only the positive part of the real axis where ${g_{\mu }}(x)$ is defined as:

(13)

\[ {g_{\mu }}(x)={a_{j}}x+{b_{j}},\hspace{1em}x\in \big[{x_{j-1}^{\textit{seg}}},{x_{j}^{\textit{seg}}}\big],\hspace{2.5pt}1\leqslant j\leqslant L,\]

where ${a_{j}}$ and ${b_{j}}$ are coefficients that will be determined latter in this section, L is the number of linear segments in the positive part and ${x_{j}^{\textit{seg}}}$ ($0\leqslant j\leqslant L$) are the boundaries between segments, where ${x_{0}^{\textit{seg}}}=0$ and ${x_{L}^{\textit{seg}}}={x_{\max }}$. The function ${g_{\mu }}(x)$ must satisfy the condition of having the same values as the function ${c_{\mu }}(x)$ in the segments’ boundaries ${x_{j}^{\textit{seg}}}$:

(14)

\[ {g_{\mu }}\big({x_{j}^{\textit{seg}}}\big)={c_{\mu }}\big({x_{j}^{\textit{seg}}}\big)\equiv \frac{{x_{\max }}}{\ln (1+\mu )}\ln \bigg(1+\frac{\mu {x_{j}^{\textit{seg}}}}{{x_{\max }}}\bigg),\hspace{1em}0\leqslant j\leqslant L.\]

This yields a symmetric PUQ^μ with L linear segments in the positive part of the real axis, performing uniform quantization with K uniformly spaced quantization levels within each segment. To ensure that all segments $[{x_{j-1}^{\textit{seg}}},{x_{j}^{\textit{seg}}})$ $(1\leqslant j\leqslant L)$ contain the same number of quantization levels, the values of ${g_{\mu }}(x)$ within the segments’ boundaries ${x_{j}^{\textit{seg}}}$ $(0\leqslant j\leqslant L)$ must be equidistant within the range $[0,{x_{\max }}]$, i.e. it must hold that ${g_{\mu }}({x_{j}^{\textit{seg}}})-{g_{\mu }}({x_{j-1}^{\textit{seg}}})={x_{\max }}/L=\operatorname{const}$, $1\leqslant j\leqslant L$. This will be achieved if the following condition is fulfilled:

(15)

\[ {g_{\mu }}\big({x_{j}^{\textit{seg}}}\big)=j\frac{{x_{\max }}}{L},\hspace{1em}0\leqslant j\leqslant L.\]

From conditions (14) and (15) it follows:

(16)

\[ \frac{{x_{\max }}}{\ln (1+\mu )}\ln \bigg(1+\frac{\mu {x_{j}^{\textit{seg}}}}{{x_{\max }}}\bigg)=j\frac{{x_{\max }}}{L},\hspace{1em}0\leqslant j\leqslant L.\]

From here it is easy to obtain ${x_{j}^{\textit{seg}}}$:

(17)

\[ {x_{j}^{\textit{seg}}}=\frac{{x_{\max }}}{\mu }\big({(1+\mu )^{j/L}}-1\big),\hspace{1em}0\leqslant j\leqslant L,\]

which is also influenced by μ and ${x_{\max }}$. To ensure equivalence between PUQ^μ and FPQ, we will set the parameters of the considered PUQ^μ to be equal to the corresponding parameters of the FPQ:

(18)

\[ {x_{\max }}={x_{\max }^{\textit{FP}}},\hspace{1em}L={L^{\textit{FP}}},\hspace{1em}K={2^{m}}={2^{R-e-1}},\hspace{1em}N=2LK={2^{R}}\big(1-{2^{1-e}}\big),\]

but also it is necessary for the PUQ^μ to satisfy the condition (4) valid for the FPQ that the width of each segment is twice as large as the width of the previous one:

(19)

\[ {x_{j+1}^{\textit{seg}}}-{x_{j}^{\textit{seg}}}=2\big({x_{j}^{\textit{seg}}}-{x_{j-1}^{\textit{seg}}}\big),\hspace{1em}1\leqslant j\leqslant {L^{\textit{FP}}}-1,\]

which will be achieved by selecting an appropriate value for the parameter μ, as will be demonstrated in the next Theorem 1.

Theorem 1.

PUQ^μ with parameters defined by (18) will be equivalent to the FPQ if $\mu ={2^{{L^{\textit{FP}}}}}-1$.

Proof.

From (17) and (19), it follows:

(20)

\[ \frac{{x_{\max }^{\textit{FP}}}}{\mu }{(1+\mu )^{j/{L^{\textit{FP}}}}}\big({(1+\mu )^{1/{L^{\textit{FP}}}}}-1\big)=2\frac{{x_{\max }^{\textit{FP}}}}{\mu }{(1+\mu )^{(j-1)/{L^{\textit{FP}}}}}\big({(1+\mu )^{1/{L^{\textit{FP}}}}}-1\big),\]

where $1\leqslant j\leqslant {L^{\textit{FP}}}$. From (20) we get that:

(21)

\[ {(1+\mu )^{j/{L^{\textit{FP}}}}}=2{(1+\mu )^{(j-1)/{L^{\textit{FP}}}}},\hspace{1em}1\leqslant j\leqslant {L^{\textit{FP}}}.\]

Based on (21), it is obvious that:

(22)

\[ {(1+\mu )^{1/{L^{\textit{FP}}}}}=2.\]

Finally, it follows that:

(23)

\[ \mu ={2^{{L^{\textit{FP}}}}}-1,\]

which concludes the proof. □

By establishing all crucial parameters, the design of the observed PUQ^μ, as well as μCQ (see (11) and (12)), is completed. Based on (17), (23), (18), and (3), we obtain the final expression for the segments’ boundaries of the PUQ^μ:

(24)

\[ {x_{j}^{\textit{seg}}}={x_{\max }^{\textit{FP}}}\frac{{2^{j}}-1}{{2^{{L^{\textit{FP}}}}}-1}={2^{{L^{\textit{FP}}}/2+1}}\frac{{2^{j}}-1}{{2^{{L^{\textit{FP}}}}}-1}\approx {2^{-{L^{\textit{FP}}}/2+1}}\big({2^{j}}-1\big),\hspace{1em}0\leqslant j\leqslant {L^{\textit{FP}}}.\]

The coefficients ${a_{j}}$ and ${b_{j}}$ ($1\leqslant j\leqslant {L^{\textit{FP}}}$) in (13) can be determined as:

(25)

\[\begin{aligned}{}& {a_{j}}=\frac{{g_{\mu }}({x_{j}^{\textit{seg}}})-{g_{\mu }}({x_{j-1}^{\textit{seg}}})}{{x_{j}^{\textit{seg}}}-{x_{j-1}^{\textit{seg}}}}=\frac{{x_{\max }^{\textit{FP}}}/{L^{\textit{FP}}}}{{x_{\max }^{\textit{FP}}}\frac{{2^{j-1}}}{{2^{{L^{\textit{FP}}}}}-1}}\approx \frac{{2^{{L^{\textit{FP}}}-j+1}}}{{L^{\textit{FP}}}},\end{aligned}\]

(26)

\[\begin{aligned}{}& {b_{j}}={g_{\mu }}\big({x_{j}^{\textit{seg}}}\big)-{a_{j}}{x_{j}^{\textit{seg}}}=\frac{{x_{\max }^{\textit{FP}}}}{{L^{\textit{FP}}}}\big(j-2+{2^{1-j}}\big)=\frac{{2^{{L^{\textit{FP}}}/2+1}}}{{L^{\textit{FP}}}}\big(j-2+{2^{1-j}}\big).\end{aligned}\]

By introducing the step size within the j-th segment of PUQ^μ:

(27)

\[ {\Delta _{j}}=\big({x_{j}^{\textit{seg}}}-{x_{j-1}^{\textit{seg}}}\big)\big/K\approx {2^{-{L^{\textit{FP}}}/2+j-R+e+1}},\hspace{1em}1\leqslant j\leqslant {L^{\textit{FP}}},\]

we finally define the decision thresholds ${x_{j,i}}$ ($0\leqslant i\leqslant K={2^{R-e-1}}$) and the representational levels ${y_{j,i}}$ ($0\leqslant i\leqslant K={2^{R-e-1}}$) of PUQ^μ within the j-th segment:

(28)

\[\begin{aligned}{}& {x_{j,i}}={x_{j-1}^{\textit{seg}}}+i{\Delta _{j}}\approx {2^{-{L^{\textit{FP}}}/2+1}}\big({2^{j-1}}\big(1+i{2^{-R+e+1}}\big)-1\big),\end{aligned}\]

(29)

\[\begin{aligned}{}& {y_{j,i}}={x_{j-1}^{\textit{seg}}}+(i-1/2){\Delta _{j}}\approx {2^{-{L^{\textit{FP}}}/2+1}}\big({2^{j-1}}\big(1+(2i-1){2^{-R+e}}\big)-1\big).\end{aligned}\]

3.2 Performance Evaluation

Here, we provide the performance (SQNR) expressions for the discussed μCQ and PUQ^μ. For μCQ, the granular distortion ${D_{g}^{\mu }}$ (the distortion component introduced in the granular part [$-x$ _max, x _max]) can be assessed using Bennett’s integral (Jayant and Noll, 1984; Chu, 2003; Gersho and Gray, 1992):

(30)

\[ {D_{g}^{\mu }}(\sigma )=2\frac{{\Delta _{u}^{2}}}{12}{\int _{0}^{{x_{\max }}}}\frac{p(x,\sigma )}{{[{c^{\prime }_{\mu }}(x)]^{2}}}dx,\]

where ${c^{\prime }_{\mu }}(x)$ is the first derivative of ${c_{\mu }}(x)$, while the overload distortion ${D_{ov}^{\mu }}$ (the distortion component introduced outside the granular part) is given by Jayant and Noll (1984), Chu (2003), Gersho and Gray (1992):

(31)

\[ {D_{ov}^{\mu }}(\sigma )=2{\int _{{x_{\max }}}^{+\infty }}{(x-{x_{\max }})^{2}}p(x,\sigma )dx.\]

The granular distortion of PUQ^μ, ${D_{g}^{{\mathrm{PUQ}^{\mu }}}}$, can be evaluated according to the following expression (Jayant and Noll, 1984; Chu, 2003; Gersho and Gray, 1992):

(32)

\[ {D_{g}^{{\mathrm{PUQ}^{\mu }}}}(\sigma )=2{\sum \limits_{j=1}^{{L^{\textit{FP}}}}}\frac{{\Delta _{j}^{2}}}{12}{P_{j}}(\sigma ),\]

where ${P_{j}}(\sigma )={\textstyle\int _{{x_{j-1}^{\textit{seg}}}}^{{x_{j}^{\textit{seg}}}}}p(x,\sigma )dx$ denotes the probability of the j-th segment ($1\leqslant j\leqslant {L^{\textit{FP}}}$), while the overload distortion of PUQ^μ, ${D_{ov}^{{\mathrm{PUQ}^{\mu }}}}$, can be estimated by (31). Theorem 2 indicates the performance of the two mentioned quantizers.

Theorem 2.

If $L\gg 1$, distortions of μCQ and its linearized version PUQ^μ converge.

Proof.

As the overload distortion for these two models is defined with the same expression (31), it is sufficient to show that Bennett’s integral (30) closely approximates the granular distortion of PUQ^μ for $L\gg 1$. Let ${d_{j}}={x_{j}^{\textit{seg}}}-{x_{j-1}^{\textit{seg}}}$ denotes the width of the segment $[{x_{j-1}^{\textit{seg}}}$, ${x_{j}^{\textit{seg}}})$ and let ${y_{j}^{\textit{seg}}}=({x_{j}^{\textit{seg}}}+{x_{j-1}^{\textit{seg}}})/2$ denotes the middle of the segment, where $1\leqslant j\leqslant L$. From the condition $L\gg 1$, it follows that the segment’s width ${d_{j}}$ is very small, so the PDF of the input data can be considered as almost constant within the segment [${x_{j-1}^{\textit{seg}}}$, ${x_{j}^{\textit{seg}}}$), i.e. $p(x,\sigma )=p({y_{j}^{\textit{seg}}},\sigma )$ for $x\in $ [${x_{j-1}^{\textit{seg}}}$, ${x_{j}^{\textit{seg}}}$); hence the segment’s probability can be defined as ${P_{j}}(\sigma )={\textstyle\int _{{x_{j-1}^{\textit{seg}}}}^{{x_{j}^{\textit{seg}}}}}p(x,\sigma )dx=p({y_{j}^{\textit{seg}}},\sigma ){\textstyle\int _{{x_{j-1}^{\textit{seg}}}}^{{x_{j}^{\textit{seg}}}}}dx=p({y_{j}^{\textit{seg}}},\sigma ){d_{j}}$. In addition, the slope of the compression function ${c_{\mu }}(x)$ can also be considered as nearly constant within the segment [${x_{j-1}^{\textit{seg}}}$, ${x_{j}^{\textit{seg}}}$), i.e. ${c^{\prime }_{\mu }}(x)={c^{\prime }_{\mu }}({y_{j}^{\textit{seg}}})=\frac{{\Delta _{u}}}{{\Delta _{j}}}$ (Jayant and Noll, 1984), from which follows that ${\Delta _{j}}=\frac{{\Delta _{u}}}{{c^{\prime }_{\mu }}({y_{j}^{\textit{seg}}})}$. Now expression (32) can be written as:

(33)

\[\begin{aligned}{}{D_{g}^{{\mathrm{PUQ}^{\mu }}}}(\sigma )& =2{\sum \limits_{j=1}^{L}}\frac{{\Delta _{j}^{2}}}{12}{P_{j}}(\sigma )=2\frac{{\Delta _{u}^{2}}}{12}{\sum \limits_{j=1}^{L}}\frac{p({y_{j}^{\textit{seg}}},\sigma )}{{[{c^{\prime }_{\mu }}({y_{j}^{\textit{seg}}})]^{2}}}{d_{j}}\\ {} & \approx 2\frac{{\Delta _{u}^{2}}}{12}{\int _{0}^{{x_{\max }}}}\frac{p(x,\sigma )}{{[{c^{\prime }_{\mu }}(x)]^{2}}}dx,\end{aligned}\]

thus concluding the proof. □

Since $L={L^{\textit{FP}}}$ and ${L^{\textit{FP}}}\gg 1$, the condition of Theorem 2 is fulfilled, ensuring the closeness of the distortions of the quantizers μCQ and PUQ^μ.

Applying (8) and (10) in (30) and combining it with (31), we arrive at the closed-form expression for the total distortion of μCQ provided in Perić et al. (2010):

(34)

\[ {D^{\mu }}(\sigma )=\underset{{D_{g}^{\mu }}(\sigma )}{\underbrace{\frac{{\ln ^{2}}(1+\mu )}{3{N^{2}}}\bigg({\bigg(\frac{{x_{\max }}}{\mu }\bigg)^{2}}+\sigma \sqrt{2}\frac{{x_{\max }}}{\mu }+{\sigma ^{2}}\bigg)}}+\underset{{D_{ov}^{\mu }}(\sigma )}{\underbrace{{\sigma ^{2}}\exp \bigg(-\sqrt{2}\frac{{x_{\max }}}{\sigma }\bigg)}}.\]

Since ${x_{\max }}={x_{\max }^{\textit{FP}}}$, then according to (3) and (23), we have that ${x_{\max }^{\textit{FP}}}/\mu ={2^{{L^{\textit{FP}}}/2+1}}/({2^{{L^{\textit{FP}}}}}-1)\approx {2^{{L^{\textit{FP}}}/2+1}}/{2^{{L^{\textit{FP}}}}}={2^{-{L^{\textit{FP}}}/2+1}}\ll 1$; hence the last expression becomes:

(35)

\[ {D^{\mu }}(\sigma )={\sigma ^{2}}\bigg(\frac{{\ln ^{2}}(1+\mu )}{3{N^{2}}}+\exp \bigg(-\sqrt{2}\frac{{x_{\max }^{\textit{FP}}}}{\sigma }\bigg)\bigg).\]

Based on (3), (18), and (23), expression (35) can be written as:

(36)

\[ {D^{\mu }}(\sigma )={\sigma ^{2}}\bigg(\frac{1}{3\cdot {2^{2R}}}{\bigg(\frac{{L^{\textit{FP}}}\ln 2}{1-{2^{1-e}}}\bigg)^{2}}+\exp \bigg(-\frac{{2^{({L^{\textit{FP}}}+3)/2}}}{\sigma }\bigg)\bigg).\]

Using (6) and (36), we derive the following final SQNR expression for μCQ:

(37)

\[ {\mathrm{SQNR}^{\mu }}(\sigma )=-10{\log _{10}}\bigg(\frac{1}{3\cdot {2^{2R}}}{\bigg(\frac{{L^{\textit{FP}}}\ln 2}{1-{2^{1-e}}}\bigg)^{2}}+\exp \bigg(-\frac{{2^{({L^{\textit{FP}}}+3)/2}}}{\sigma }\bigg)\bigg).\]

Based on (6), (27), and (32), knowing that for the Laplacian PDF ${\textstyle\int _{{x_{j-1}^{\textit{seg}}}}^{{x_{j}^{\textit{seg}}}}}p(x,\sigma )dx=\frac{1}{2}(\exp (-\sqrt{2}{x_{j-1}^{\textit{seg}}}/\sigma )-\exp (-\sqrt{2}{x_{j}^{\textit{seg}}}/\sigma ))$, the SQNR expression for PUQ^μ is delivered:

(38)

\[\begin{aligned}{}& {\mathrm{SQNR}^{{\mathrm{PUQ}^{\mu }}}}(\sigma )\\ {} & \hspace{1em}=-10{\log _{10}}\bigg[{\sum \limits_{j=1}^{{L^{\textit{FP}}}}}\frac{{2^{-{L^{\textit{FP}}}+2j-2(R-e)}}}{3{\sigma ^{2}}}\bigg(\exp \bigg(-\frac{{2^{-({L^{\textit{FP}}}-3)/2}}({2^{j-1}}-1)}{\sigma }\bigg)\\ {} & \hspace{2em}-\exp \bigg(-\frac{{2^{-({L^{\textit{FP}}}-3)/2}}({2^{j}}-1)}{\sigma }\bigg)\bigg)+\exp \bigg(-\frac{{2^{({L^{\textit{FP}}}+3)/2}}}{\sigma }\bigg)\bigg].\end{aligned}\]

Since provided in closed form, expression (37) exhibits substantially lower computational complexity in contrast to expressions (9) and (38). The next section will demonstrate that the numerical results yielded by (9), (37), and (38) closely align, implying that the closed-form expression (37) serves as a very precise approximation for the SQNR of the FPQ, and therefore for the performance of the FP format.

4 Numerical Results and Discussion

In this Section, we present and discuss numerical results for the derived SQNR formulas (37) and (38) obtained in evaluating the performance of the R-bit FP format with $e=8$ (8-bit exponent) in a very wide variance range, where $R=24$ and 32 (i.e. FP24 and FP32 formats). Note that the diversity across bit rate R is introduced to show the generality of the given formulas, whose effectiveness is measured with respect to formula (9). To facilitate the observation of variance across a wide range, it is usual to define variance in the logarithmic domain as ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]=10{\log _{10}}({\sigma ^{2}}/{\sigma _{\textit{ref}}^{2}})$, where ${\sigma _{\textit{ref}}^{2}}$ represents the referent variance. Without loss of generality, we can assume that ${\sigma _{\textit{ref}}^{2}}=1$, obtaining ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]=10{\log _{10}}{\sigma ^{2}}$. Substituting $\sigma ={10^{{\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]/20}}$ into the previously derived expressions for SQNR yields the dependence of SQNR on ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]$.

Fig. 1

Performance (SQNR) of FP24 and FP32 formats in a very wide variance range, estimated using different formulas.

Figure 1 shows the performance (SQNR) of the FP24 and FP32 formats over a very wide variance range ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\in $ [− 500 dB, 800 dB], calculated using (9), (37), and (38). It’s worth mentioning that the chosen variance range is significantly broader than that commonly used for scalar quantizer analysis (typically ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\in $ [− 20 dB, 20 dB] or ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\in $ [− 30 dB, 30 dB], as seen in Perić et al. (2010), Denić et al. (2023)). From the given figure, it can be noted that the results for SQNR formulas (9) and (38) are in excellent agreement for each considered ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]$. Based on this performance matching, we argue that the discussed PUQ^μ and FPQ are compatible, proving the correctness of the applied design process. It can also be observed that the SQNR values achieved by (37) are very close to those achieved by (38) (and accordingly by (9)), which is in agreement with Theorem 2. From Fig. 1, it is clearly evident that there is a threshold variance, denoted by ${\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$, such that for ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\leqslant {\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$ the granular distortion ${D_{g}^{\mu }}$ dominates, so ${\mathrm{SQNR}^{\mu }}$ becomes:

(39)

\[ {\mathrm{SQNR}^{\mu }}\approx 10{\log _{10}}\bigg(\frac{{\sigma ^{2}}}{{D_{g}^{\mu }}}\bigg)=-10{\log _{10}}\bigg(\frac{1}{3\cdot {2^{2R}}}\bigg(\frac{{L^{\textit{FP}}}\ln 2}{1-{2^{1-e}}}\bigg)\bigg)=\operatorname{const},\]

i.e. it remains constant and does not depend on the data variance ${\sigma ^{2}}$. This can be interpreted as follows. Since the ${\mathrm{SQNR}^{\mu }}$ is independent of the PDF parameter ${\sigma ^{2}}$, then using any non-parametric Laplacian distribution yields the same SQNR score. On the other hand, for ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\gt {\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$ the overload distortion ${D_{ov}^{\mu }}$ prevails, leading to a sharp drop in SQNR. The threshold variance ${\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$ is 745 dB for the FP24 format and 742 dB for the FP32 format.

Let us introduce the relative error δ _SQNR[%] as an accuracy measure of the SQNR formula (37) with respect to (9). The values for δ _SQNR [%] are illustrated in Fig 2. Figure 2 indicates that the SQNR calculation error for ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\leqslant {\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$ is below 0.5% in the case of FP24 format performance evaluation and below 0.35% in the case of FP32 format performance evaluation; for ${\sigma ^{2}}\hspace{2.5pt}[\mathrm{dB}]\gt {\sigma _{t}^{2}}\hspace{2.5pt}[\mathrm{dB}]$, the SQNR error tends to zero, as predicted. Given that δ _SQNR[%] < 1%, we report that reasonable accuracy of the SQNR formula defined in Na (2011) is achieved with the proposed approximate formula (37). Due to this achievement and the fact that (37) is considerably less computationally intensive than (9), which includes ${L^{\textit{FP}}}=254$ sum members (since $e=8$), we confirm that (37) can indeed be used as an adequate tool for evaluating the performance of the FP format for Laplacian data.

Fig. 2

Accuracy of the SQNR formula (37) in estimating the performance of FP24 and FP32 formats.

Given SQNR analysis can also be useful in selecting the optimal FP format for the target application. Specifically, from the point of quality of digital representation, FP32 is a better solution than FP24, due to the higher SQNR score; however, from the point of dynamic range, both FP32 and FP24 formats are very efficient as they retain constancy in SQNR across a very wide variance range. Due to these positive features and the fact that its implementation complexity is lower than FP32, FP24 can be seen as an attractive choice for various practical applications.

5 Conclusion

This paper builds upon the analogy between FP digital representation and quantization established in literature, introducing a novel idea regarding the link between the FP format and the μCQ. It presents a method for designing and linearizing the μCQ to achieve a piecewise uniform quantizer PUQ^μ tailored to the FP format. Given the FP format’s similarity in structure to PUQ^μ and the close performance of PUQ^μ to μCQ, a closed-form expression for the SQNR of μCQ has been proposed in this paper to evaluate FP format’s performance, which holds general applicability across various bit rates and data variances. Numerical assessments spanning a very wide variance range, conducted for some commonly used FP formats with an 8-bit exponent, showed the full applicability of the proposed SQNR expression in FP format performance evaluation, as competitive results (SQNR calculation error is below the predefined threshold of 1%) and significantly lower computational intensity have been observed with respect to the existing method reliant on the summation of numerous terms (254 in the situation when $e=8$). As the computational complexity of the existing method increases even more for $e\gt 8$, a significant simplification of the FP format evaluation process is expected by applying the proposed method. Providing an efficient and accurate mechanism for the evaluation of FP format performance, this paper facilitates the selection of the optimal FP bit configuration for a specific application, crucial for digital representation quality, dynamic range, computational overhead, and energy efficiency.

A Appendix

Table A1 provides an overview of abbreviations and specific symbols used in this paper.

Table A1

Employed abbreviations and symbols.

Abbreviations		Symbols
Bfloat16	16-bit floating point format	E	exponent of a floating point number
DLfloat	16-bit floating point format	M	significand of a floating point number
FP8	8-bit floating point format	e	number of bits for exponent
FP24	24-bit floating point format	m	number of bits for significand
FP32	32-bit floating point format	R	bit rate
FP64	64-bit floating point format	${E^{\ast }}$	biased exponent
FPQ	floating point quantizer	${E_{\min }^{\ast }}$	minimal value of biased exponent
PDF	probability density function	${E_{\max }^{\ast }}$	maximum value of biased exponent
PUQ^μ	piecewise uniform quantizer based on the μ-law compression function	${S_{{E^{\ast }}}}$	segment in the positive part of floating point numbers
SQNR	signal to quantization noise ratio	${L^{\textit{FP}}}$	number of segments ${S_{{E^{\ast }}}}$
μCQ	μ-law companding quantizer	${\Delta _{{E^{\ast }}}}$	step size in segment ${S_{{E^{\ast }}}}$
		${\delta _{{E^{\ast }}}}$	width of segment ${S_{{E^{\ast }}}}$
		${P_{{E^{\ast }}}}(\sigma )$	probability of segment ${S_{{E^{\ast }}}}$
		${x_{\max }^{\textit{FP}}}$	maximal floating point number
		${\sigma ^{2}}$	variance of input Laplacian data
		${D_{g}^{\mathrm{FPQ}}}$	granular distortion of FPQ
		${D_{ov}^{\mathrm{FPQ}}}$	overload distortion of FPQ
		D^FPQ	total distortion of FPQ
		SQNR^FPQ	signal to quantization noise ratio of FPQ
		${c_{\mu }}(x)$	μ-law compression function
		${c_{\mu }^{-1}}(x)$	inverse μ-law compression function
		μ	compression factor
		${x_{\max }}$	maximal amplitude of μCQ
		${x_{j}^{\mu }}$	decision thresholds of μCQ
		${y_{j}^{\mu }}$	representational levels of μCQ
		N	number of representational levels
		${\Delta _{u}}$	step size of the uniform quantizer
		${g_{\mu }}(x)$	piecewise linear compression function
		${a_{j}}$	coefficient of ${g_{\mu }}(x)$
		${b_{j}}$	coefficient of ${g_{\mu }}(x)$
		L	number of segments of PUQ^μ
		${x_{j}^{\textit{seg}}}$	segment thresholds of PUQ^μ
		K	number of uniform levels within PUQ^μ segments
		${\Delta _{j}}$	step size within segment of PUQ^μ
		${x_{j,i}}$	i-th decision threshold within the j-th segment of PUQ^μ
		${y_{j,i}}$	i-th representational level within the j-th segment of PUQ^μ
		${D_{g}^{\mu }}$	granular distortion of μCQ
		${D_{ov}^{\mu }}$	overload distortion of μCQ
		${D^{\mu }}$	total distortion of μCQ
		SQNR^μ	signal to quantization noise ratio of μCQ
		${P_{j}}$	segment probability of PUQ^μ
		${D_{g}^{{\mathrm{PUQ}^{\mu }}}}$	granular distortion of PUQ^μ
		${D_{ov}^{{\mathrm{PUQ}^{\mu }}}}$	overload distortion of PUQ^μ
		${\mathrm{SQNR}^{{\mathrm{PUQ}^{\mu }}}}$	signal to quantization noise ratio of PUQ^μ

References

Agrawal, A., Mueller, S.M., Fleischer, B.M., Sun, X., Wang, N., Choi, J., Gopalakrishnan, K. (2019). DLFloat: a 16-b floating point format designed for deep learning training and inference. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, pp. 92–95. https://doi.org/10.1109/ARITH.2019.00023.

Bai-Kui, Y., Shanq-Jang, R. (2023). Area efficient compression for floating-point feature maps in convolutional neural network accelerators. IEEE Transactions on Circuits and Systems II, 70(2), 746–750. https://doi.org/10.1109/TCSII.2022.3213847.

Banner, R., Nahshan, Y., Hoffer, E., Soudry, D. (2018). ACIQ: Analytical Clipping for Integer Quantization of Neural Networks. arXiv preprint, arXiv:1810.05723.

Banner, R., Nahshan, Y., Soudry, D. (2019). Post training 4-bit quantization of convolutional networks for rapid-deployment. In: Proceedings of the33rd Conference on Neural Information Processing Systems, (NeurIPS), No. 714, Vancouver, BC, Canada, pp. 7950–7958.

Botta, M., Cavagnino, D., Esposito, R. (2021). NeuNAC: a novel fragile watermarking algorithm for integrity protection of neural networks. Information Sciences, 576, 228–241. https://doi.org/10.1016/j.ins.2021.06.073.

Burgess, N., Goodyer, C., Hinds, C.N., Lutz, D.R. (2019a). High-precision anchored accumulators for reproducible floating-point summation. IEEE Transactions on Computers, 68(7), 967–978. https://doi.org/10.1109/TC.2018.2855729.

Burgess, N., Milanovic, J., Stephens, N., Monachopoulos, K., Mansell, D. (2019b). Bfloat16 processing for neural networks. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June, pp. 10–12. https://doi.org/10.1109/ARITH.2019.00022.

Cattaneo, D., Di Bello, A., Cherubin, S., Terraneo, F., Agosta, G. (2018). Embedded operating system optimization through floating to fixed point compiler transformation. In: Proceedings of the 21-st Euromicro Conference on Digital System Design (DSD), Prague, Czech Republic, pp. 172–176.

Chu, W.C. (2003). Speech Coding Algorithms: Foundation and Evolution of Standardized Coders. John Wiley & Sons, New Jersey.

Denić, B., Perić, Z., Dinčić, M. (2023). Improvement of the Bfloat16 floating-point for the Laplacian source. In: Proceedings of the IEEE 13th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4 . https://doi.org/10.1109/ATEE58038.2023.10108130.

Dinčić, M., Perić, Z., Tančić, M., Denić, D., Stamenković, Z., Denić, B. (2021). Support region of μ-law logarithmic quantizers for Laplacian source applied in neural networks. Microelectronics Reliability, 124, 114269.

Dinčić, M., Perić, Z., Jovanović, A. (2016). New coding algorithm based on variable-length codewords for piecewise uniform quantizers. Informatica, 27(3), 527–548. https://doi.org/10.15388/Informatica.2016.98.

Fasi, M., Mikaitis, M. (2021). Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. IEEE Transactions on Emerging Topics in Computing, 9(3), 1451–1466. https://doi.org/10.1109/TETC.2021.3069165.

Gazor, S., Zhang, W. (2003). Speech probability distribution. IEEE Signal Processing Letters, 10(7), 204–207. https://doi.org/10.1109/LSP.2003.813679.

Gersho, A., Gray, R. (1992). Vector Quantization and Signal Compression. Kluwer Academic Publishers, New York.

IEEE 754 (2019). IEEE Standard for Floating Point Arithmetic.

Jayant, N.C., Noll, P. (1984). Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, New Jersey.

Junaid, M., Arslan, S., Lee, T., Kim, H. (2022). Optimal architecture of floating-point arithmetic for neural network training processor. Sensors, 22, 1230. https://doi.org/10.3390/s22031230.

Moroz, L., Samotyy, V. (2019). Efficient floating-point division for digital signal processing application. IEEE Signal Processing Magazine, 36(1), 159–163. https://doi.org/10.1109/MSP.2018.2875977.

Na, S. (2011). Asymptotic formulas for variance-mismatched fixed-rate scalar quantization of a Gaussian source. IEEE Transactions on Signal Processing, 59(5), 2437–2441. https://doi.org/10.1109/TSP.2011.2112354.

Perić, Z., Dinčić, M., Denić, D., Jocić, A. (2010). Forward adaptive logarithmic quantizer with new lossless coding method for Laplacian source. Wireless Personal Communications, 59(4), 625–641. https://doi.org/10.1007/s11277-010-9929-3.

Perić, Z., Savić, M., Dinčić, M., Vučić, N., Djošić, D., Milosavljević, S. (2021). Floating point and fixed point 32-bits quantizers for quantization of weights of neural networks. In: Proceedings of the IEEE 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4. https://doi.org/10.1109/ATEE52255.2021.9425265.

Syed, R.T., Ulbricht, M., Piotrowski, K., Krstic, M. (2021). Fault resilience analysis of quantized deep neural networks. In: Proceedings of the IEEE 32nd International Conference on Microelectronics (MIEL), Niš, Serbia, pp. 275–294. https://doi.org/10.1109/MIEL52794.2021.9569094.

Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 7686–7695.

Yang, Y., Chi, X., Deng, L., Yan, T., Gao, F., Li, G. (2022). Towards efficient full 8-bit integer DNN online training on resource-limited devices without batch normalization. Neurocomputing, 511, 175–186. https://doi.org/10.1016/j.neucom.2022.08.045.

Zhao, W., Dang, Q., Xia, T., Zhang, J., Zheng, N., Ren, P. (2023). Optimizing FPGA-based DNN accelerator with shared exponential floating-point format. IEEE Transactions on Circuits and Systems I, 70(11), 4478–4491. https://doi.org/10.1109/TCSI.2023.3300657.

Biographies

Perić Zoran

zoran.peric@elfak.ni.ac.rs

Z. Perić was born in Niš, Serbia, in 1964. He received the BS, MS and PhD degrees from the Faculty of Electronic Engineering, University of Niš, Serbia, in 1989, 1994 and 1999, respectively. He is a full-time professor at Department of Telecommunications, Faculty of Electronic Engineering, University of Niš. His current research interests include the information theory and signal processing. He is an author and co-author of over 350 papers. Dr. Peric has been a reviewer of a number of journals, including IEEE Transactions on Information Theory, IEEE Transactions on Signal Processing, IEEE Transactions on Communications, Compel, Informatica, Information Technology and Control, Expert Systems with Applications and Digital Signal Processing.

Denić Bojan

bojan.denic@elfak.ni.ac.rs

B. Denić received his PhD degree in the field of Telecommunications in 2023 from the Faculty of Electronic Engineering, University of Niš, Serbia. Currently, he is working as a research associate at the same faculty. His main research interests include signal processing, quantization and machine learning. He is an author of 34 scientific papers (18 of them in peer-reviewed international journals).

Dinčić Milan

milan.dincic@elfak.ni.ac.rs

M. Dinčić received MSc in 2007, PhD in the field of Telecommunication in 2012 and PhD in the field of Measurements in 2017 from the University of Niš. Currently, he is working as an associate professor at the Faculty of Electronic Engineering. He is an author of 64 scientific papers (35 of them in reputable international journals with IF from the SCI/SCIe list). His research is related to quantization and compression of neural networks, sensors and measurement systems.

Perić Sofija

sofija.peric@elfak.ni.ac.rs

Reading mode

Table of contents

1 Introduction
2 Description of the Floating-Point Format
3 A Closed-Form SQNR Expression Derivation by Designing and Linearizing a μ-Law Companding Quantizer Related to FPQ
4 Numerical Results and Discussion
5 Conclusion
A Appendix
References
Biographies

Open access article under the CC BY license.

Keywords

floating-point format piecewise uniform quantization μ-law companding quantization Laplacian source

Funding

This work was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-65/2024-03/200102], as well as by the European Union’s Horizon 2023 research and innovation programme through the AIDA4Edge Twinning project (grant ID 101160293).

Metrics

since January 2020

250

Article info
views

111

Full article
views

PDF
downloads

XML
downloads

RSS

Figures
2
Tables
1
Theorems
2

Fig. 1

Performance (SQNR) of FP24 and FP32 formats in a very wide variance range, estimated using different formulas.

Fig. 2

Accuracy of the SQNR formula (37) in estimating the performance of FP24 and FP32 formats.

Table A1

Employed abbreviations and symbols.

Theorem 1.

Theorem 2.

Fig. 1

Performance (SQNR) of FP24 and FP32 formats in a very wide variance range, estimated using different formulas.

Fig. 2

Accuracy of the SQNR formula (37) in estimating the performance of FP24 and FP32 formats.

Table A1

Employed abbreviations and symbols.

Abbreviations		Symbols
Bfloat16	16-bit floating point format	E	exponent of a floating point number
DLfloat	16-bit floating point format	M	significand of a floating point number
FP8	8-bit floating point format	e	number of bits for exponent
FP24	24-bit floating point format	m	number of bits for significand
FP32	32-bit floating point format	R	bit rate
FP64	64-bit floating point format	${E^{\ast }}$	biased exponent
FPQ	floating point quantizer	${E_{\min }^{\ast }}$	minimal value of biased exponent
PDF	probability density function	${E_{\max }^{\ast }}$	maximum value of biased exponent
PUQ^μ	piecewise uniform quantizer based on the μ-law compression function	${S_{{E^{\ast }}}}$	segment in the positive part of floating point numbers
SQNR	signal to quantization noise ratio	${L^{\textit{FP}}}$	number of segments ${S_{{E^{\ast }}}}$
μCQ	μ-law companding quantizer	${\Delta _{{E^{\ast }}}}$	step size in segment ${S_{{E^{\ast }}}}$
		${\delta _{{E^{\ast }}}}$	width of segment ${S_{{E^{\ast }}}}$
		${P_{{E^{\ast }}}}(\sigma )$	probability of segment ${S_{{E^{\ast }}}}$
		${x_{\max }^{\textit{FP}}}$	maximal floating point number
		${\sigma ^{2}}$	variance of input Laplacian data
		${D_{g}^{\mathrm{FPQ}}}$	granular distortion of FPQ
		${D_{ov}^{\mathrm{FPQ}}}$	overload distortion of FPQ
		D^FPQ	total distortion of FPQ
		SQNR^FPQ	signal to quantization noise ratio of FPQ
		${c_{\mu }}(x)$	μ-law compression function
		${c_{\mu }^{-1}}(x)$	inverse μ-law compression function
		μ	compression factor
		${x_{\max }}$	maximal amplitude of μCQ
		${x_{j}^{\mu }}$	decision thresholds of μCQ
		${y_{j}^{\mu }}$	representational levels of μCQ
		N	number of representational levels
		${\Delta _{u}}$	step size of the uniform quantizer
		${g_{\mu }}(x)$	piecewise linear compression function
		${a_{j}}$	coefficient of ${g_{\mu }}(x)$
		${b_{j}}$	coefficient of ${g_{\mu }}(x)$
		L	number of segments of PUQ^μ
		${x_{j}^{\textit{seg}}}$	segment thresholds of PUQ^μ
		K	number of uniform levels within PUQ^μ segments
		${\Delta _{j}}$	step size within segment of PUQ^μ
		${x_{j,i}}$	i-th decision threshold within the j-th segment of PUQ^μ
		${y_{j,i}}$	i-th representational level within the j-th segment of PUQ^μ
		${D_{g}^{\mu }}$	granular distortion of μCQ
		${D_{ov}^{\mu }}$	overload distortion of μCQ
		${D^{\mu }}$	total distortion of μCQ
		SQNR^μ	signal to quantization noise ratio of μCQ
		${P_{j}}$	segment probability of PUQ^μ
		${D_{g}^{{\mathrm{PUQ}^{\mu }}}}$	granular distortion of PUQ^μ
		${D_{ov}^{{\mathrm{PUQ}^{\mu }}}}$	overload distortion of PUQ^μ
		${\mathrm{SQNR}^{{\mathrm{PUQ}^{\mu }}}}$	signal to quantization noise ratio of PUQ^μ

Theorem 1.

PUQ^μ with parameters defined by (18) will be equivalent to the FPQ if $\mu ={2^{{L^{\textit{FP}}}}}-1$.

Theorem 2.

If $L\gg 1$, distortions of μCQ and its linearized version PUQ^μ converge.

Authors

Abstract

1 Introduction

2 Description of the Floating-Point Format

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

3 A Closed-Form SQNR Expression Derivation by Designing and Linearizing a μ-Law Companding Quantizer Related to FPQ

3.1 Design of a μ-Law Companding Quantizer Inspired by the FP Format

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17)

(18)

(19)

Theorem 1.

Proof.

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

3.2 Performance Evaluation

(30)

(31)

(32)

Theorem 2.

Proof.

(33)

(34)

(35)

(36)

(37)

(38)

4 Numerical Results and Discussion

Fig. 1

(39)

Fig. 2

5 Conclusion

A Appendix

Table A1

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file

Fig. 1

Fig. 2

Table A1

Theorem 1.

Theorem 2.