An Approximate Closed-Form Expression for Calculating Performance of Floating-Point Format for the Laplacian Source

Perić, Zoran; Denić, Bojan; Dinčić, Milan; Perić, Sofija

doi:10.15388/25-INFOR587

Informatica

An Approximate Closed-Form Expression for Calculating Performance of Floating-Point Format for the Laplacian Source

Volume 36, Issue 1 (2025), pp. 125–140

Zoran Perić Bojan Denić Milan Dinčić Sofija Perić

https://doi.org/10.15388/25-INFOR587

Pub. online: 18 March 2025 Type: Research Article

Open Access

Received
1 October 2024

Accepted
1 March 2025

Published
18 March 2025

Abstract

This paper introduces a novel approach that bridges the floating-point (FP) format, widely utilized in diverse fields for data representation, with the μ-law companding quantizer, proposing a method for designing and linearizing the μ-law companding quantizer to yield a piecewise uniform quantizer tailored to the FP format. A key outcome of the paper is a closed-form approximate expression for closely and efficiently evaluating the FP format’s performance for data with the Laplacian distribution. This expression offers generality across various bit rates and data variances, markedly reducing the computational complexity of FP performance evaluation compared to prior methods reliant on summation of a large number of terms. By facilitating the evaluation of FP format performance, this research substantially aids in the selection of the optimal bit rates, crucial for digital representation quality, dynamic range, computational overhead, and energy efficiency. The numerical calculations spanning a wide range of data variances provided for some commonly used FP versions with an 8-bit exponent have demonstrated that the proposed closed-form expression closely approximates FP format performance.

References

Agrawal, A., Mueller, S.M., Fleischer, B.M., Sun, X., Wang, N., Choi, J., Gopalakrishnan, K. (2019). DLFloat: a 16-b floating point format designed for deep learning training and inference. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, pp. 92–95. https://doi.org/10.1109/ARITH.2019.00023.

Bai-Kui, Y., Shanq-Jang, R. (2023). Area efficient compression for floating-point feature maps in convolutional neural network accelerators. IEEE Transactions on Circuits and Systems II, 70(2), 746–750. https://doi.org/10.1109/TCSII.2022.3213847.

Banner, R., Nahshan, Y., Hoffer, E., Soudry, D. (2018). ACIQ: Analytical Clipping for Integer Quantization of Neural Networks. arXiv preprint, arXiv:1810.05723.

Banner, R., Nahshan, Y., Soudry, D. (2019). Post training 4-bit quantization of convolutional networks for rapid-deployment. In: Proceedings of the33rd Conference on Neural Information Processing Systems, (NeurIPS), No. 714, Vancouver, BC, Canada, pp. 7950–7958.

Botta, M., Cavagnino, D., Esposito, R. (2021). NeuNAC: a novel fragile watermarking algorithm for integrity protection of neural networks. Information Sciences, 576, 228–241. https://doi.org/10.1016/j.ins.2021.06.073.

Burgess, N., Goodyer, C., Hinds, C.N., Lutz, D.R. (2019a). High-precision anchored accumulators for reproducible floating-point summation. IEEE Transactions on Computers, 68(7), 967–978. https://doi.org/10.1109/TC.2018.2855729.

Burgess, N., Milanovic, J., Stephens, N., Monachopoulos, K., Mansell, D. (2019b). Bfloat16 processing for neural networks. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June, pp. 10–12. https://doi.org/10.1109/ARITH.2019.00022.

Cattaneo, D., Di Bello, A., Cherubin, S., Terraneo, F., Agosta, G. (2018). Embedded operating system optimization through floating to fixed point compiler transformation. In: Proceedings of the 21-st Euromicro Conference on Digital System Design (DSD), Prague, Czech Republic, pp. 172–176.

Chu, W.C. (2003). Speech Coding Algorithms: Foundation and Evolution of Standardized Coders. John Wiley & Sons, New Jersey.

Denić, B., Perić, Z., Dinčić, M. (2023). Improvement of the Bfloat16 floating-point for the Laplacian source. In: Proceedings of the IEEE 13th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4 . https://doi.org/10.1109/ATEE58038.2023.10108130.

Dinčić, M., Perić, Z., Tančić, M., Denić, D., Stamenković, Z., Denić, B. (2021). Support region of μ-law logarithmic quantizers for Laplacian source applied in neural networks. Microelectronics Reliability, 124, 114269.

Dinčić, M., Perić, Z., Jovanović, A. (2016). New coding algorithm based on variable-length codewords for piecewise uniform quantizers. Informatica, 27(3), 527–548. https://doi.org/10.15388/Informatica.2016.98.

Fasi, M., Mikaitis, M. (2021). Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. IEEE Transactions on Emerging Topics in Computing, 9(3), 1451–1466. https://doi.org/10.1109/TETC.2021.3069165.

Gazor, S., Zhang, W. (2003). Speech probability distribution. IEEE Signal Processing Letters, 10(7), 204–207. https://doi.org/10.1109/LSP.2003.813679.

Gersho, A., Gray, R. (1992). Vector Quantization and Signal Compression. Kluwer Academic Publishers, New York.

IEEE 754 (2019). IEEE Standard for Floating Point Arithmetic.

Jayant, N.C., Noll, P. (1984). Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, New Jersey.

Junaid, M., Arslan, S., Lee, T., Kim, H. (2022). Optimal architecture of floating-point arithmetic for neural network training processor. Sensors, 22, 1230. https://doi.org/10.3390/s22031230.

Moroz, L., Samotyy, V. (2019). Efficient floating-point division for digital signal processing application. IEEE Signal Processing Magazine, 36(1), 159–163. https://doi.org/10.1109/MSP.2018.2875977.

Na, S. (2011). Asymptotic formulas for variance-mismatched fixed-rate scalar quantization of a Gaussian source. IEEE Transactions on Signal Processing, 59(5), 2437–2441. https://doi.org/10.1109/TSP.2011.2112354.

Perić, Z., Dinčić, M., Denić, D., Jocić, A. (2010). Forward adaptive logarithmic quantizer with new lossless coding method for Laplacian source. Wireless Personal Communications, 59(4), 625–641. https://doi.org/10.1007/s11277-010-9929-3.

Perić, Z., Savić, M., Dinčić, M., Vučić, N., Djošić, D., Milosavljević, S. (2021). Floating point and fixed point 32-bits quantizers for quantization of weights of neural networks. In: Proceedings of the IEEE 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4. https://doi.org/10.1109/ATEE52255.2021.9425265.

Syed, R.T., Ulbricht, M., Piotrowski, K., Krstic, M. (2021). Fault resilience analysis of quantized deep neural networks. In: Proceedings of the IEEE 32nd International Conference on Microelectronics (MIEL), Niš, Serbia, pp. 275–294. https://doi.org/10.1109/MIEL52794.2021.9569094.

Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 7686–7695.

Yang, Y., Chi, X., Deng, L., Yan, T., Gao, F., Li, G. (2022). Towards efficient full 8-bit integer DNN online training on resource-limited devices without batch normalization. Neurocomputing, 511, 175–186. https://doi.org/10.1016/j.neucom.2022.08.045.

Zhao, W., Dang, Q., Xia, T., Zhang, J., Zheng, N., Ren, P. (2023). Optimizing FPGA-based DNN accelerator with shared exponential floating-point format. IEEE Transactions on Circuits and Systems I, 70(11), 4478–4491. https://doi.org/10.1109/TCSI.2023.3300657.

Biographies

Perić Zoran

zoran.peric@elfak.ni.ac.rs

Z. Perić was born in Niš, Serbia, in 1964. He received the BS, MS and PhD degrees from the Faculty of Electronic Engineering, University of Niš, Serbia, in 1989, 1994 and 1999, respectively. He is a full-time professor at Department of Telecommunications, Faculty of Electronic Engineering, University of Niš. His current research interests include the information theory and signal processing. He is an author and co-author of over 350 papers. Dr. Peric has been a reviewer of a number of journals, including IEEE Transactions on Information Theory, IEEE Transactions on Signal Processing, IEEE Transactions on Communications, Compel, Informatica, Information Technology and Control, Expert Systems with Applications and Digital Signal Processing.

Denić Bojan

bojan.denic@elfak.ni.ac.rs

B. Denić received his PhD degree in the field of Telecommunications in 2023 from the Faculty of Electronic Engineering, University of Niš, Serbia. Currently, he is working as a research associate at the same faculty. His main research interests include signal processing, quantization and machine learning. He is an author of 34 scientific papers (18 of them in peer-reviewed international journals).

Dinčić Milan

milan.dincic@elfak.ni.ac.rs

M. Dinčić received MSc in 2007, PhD in the field of Telecommunication in 2012 and PhD in the field of Measurements in 2017 from the University of Niš. Currently, he is working as an associate professor at the Faculty of Electronic Engineering. He is an author of 64 scientific papers (35 of them in reputable international journals with IF from the SCI/SCIe list). His research is related to quantization and compression of neural networks, sensors and measurement systems.

Perić Sofija

sofija.peric@elfak.ni.ac.rs

Full article Related articles

Open access article under the CC BY license.

Keywords

floating-point format piecewise uniform quantization μ-law companding quantization Laplacian source

Funding

This work was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-65/2024-03/200102], as well as by the European Union’s Horizon 2023 research and innovation programme through the AIDA4Edge Twinning project (grant ID 101160293).

Metrics

since January 2020

250

Article info
views

108

Full article
views

PDF
downloads

XML
downloads

RSS

Authors

Abstract

References

Biographies

Export citation

Copy and paste formatted citation

Download citation in file