Informatica logo


Login Register

  1. Home
  2. Issues
  3. Volume 36, Issue 1 (2025)
  4. An Approximate Closed-Form Expression fo ...

Informatica

Information Submit your article For Referees Help ATTENTION!
  • Article info
  • Full article
  • Related articles
  • More
    Article info Full article Related articles

An Approximate Closed-Form Expression for Calculating Performance of Floating-Point Format for the Laplacian Source
Volume 36, Issue 1 (2025), pp. 125–140
Zoran Perić   Bojan Denić   Milan Dinčić   Sofija Perić  

Authors

 
Placeholder
https://doi.org/10.15388/25-INFOR587
Pub. online: 18 March 2025      Type: Research Article      Open accessOpen Access

Received
1 October 2024
Accepted
1 March 2025
Published
18 March 2025

Abstract

This paper introduces a novel approach that bridges the floating-point (FP) format, widely utilized in diverse fields for data representation, with the μ-law companding quantizer, proposing a method for designing and linearizing the μ-law companding quantizer to yield a piecewise uniform quantizer tailored to the FP format. A key outcome of the paper is a closed-form approximate expression for closely and efficiently evaluating the FP format’s performance for data with the Laplacian distribution. This expression offers generality across various bit rates and data variances, markedly reducing the computational complexity of FP performance evaluation compared to prior methods reliant on summation of a large number of terms. By facilitating the evaluation of FP format performance, this research substantially aids in the selection of the optimal bit rates, crucial for digital representation quality, dynamic range, computational overhead, and energy efficiency. The numerical calculations spanning a wide range of data variances provided for some commonly used FP versions with an 8-bit exponent have demonstrated that the proposed closed-form expression closely approximates FP format performance.

References

 
Agrawal, A., Mueller, S.M., Fleischer, B.M., Sun, X., Wang, N., Choi, J., Gopalakrishnan, K. (2019). DLFloat: a 16-b floating point format designed for deep learning training and inference. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic (ARITH), Kyoto, Japan, pp. 92–95. https://doi.org/10.1109/ARITH.2019.00023.
 
Bai-Kui, Y., Shanq-Jang, R. (2023). Area efficient compression for floating-point feature maps in convolutional neural network accelerators. IEEE Transactions on Circuits and Systems II, 70(2), 746–750. https://doi.org/10.1109/TCSII.2022.3213847.
 
Banner, R., Nahshan, Y., Hoffer, E., Soudry, D. (2018). ACIQ: Analytical Clipping for Integer Quantization of Neural Networks. arXiv preprint, arXiv:1810.05723.
 
Banner, R., Nahshan, Y., Soudry, D. (2019). Post training 4-bit quantization of convolutional networks for rapid-deployment. In: Proceedings of the33rd Conference on Neural Information Processing Systems, (NeurIPS), No. 714, Vancouver, BC, Canada, pp. 7950–7958.
 
Botta, M., Cavagnino, D., Esposito, R. (2021). NeuNAC: a novel fragile watermarking algorithm for integrity protection of neural networks. Information Sciences, 576, 228–241. https://doi.org/10.1016/j.ins.2021.06.073.
 
Burgess, N., Goodyer, C., Hinds, C.N., Lutz, D.R. (2019a). High-precision anchored accumulators for reproducible floating-point summation. IEEE Transactions on Computers, 68(7), 967–978. https://doi.org/10.1109/TC.2018.2855729.
 
Burgess, N., Milanovic, J., Stephens, N., Monachopoulos, K., Mansell, D. (2019b). Bfloat16 processing for neural networks. In: Proceedings of the IEEE 26th Symposium on Computer Arithmetic, ARITH 2019, Kyoto, Japan, June, pp. 10–12. https://doi.org/10.1109/ARITH.2019.00022.
 
Cattaneo, D., Di Bello, A., Cherubin, S., Terraneo, F., Agosta, G. (2018). Embedded operating system optimization through floating to fixed point compiler transformation. In: Proceedings of the 21-st Euromicro Conference on Digital System Design (DSD), Prague, Czech Republic, pp. 172–176.
 
Chu, W.C. (2003). Speech Coding Algorithms: Foundation and Evolution of Standardized Coders. John Wiley & Sons, New Jersey.
 
Denić, B., Perić, Z., Dinčić, M. (2023). Improvement of the Bfloat16 floating-point for the Laplacian source. In: Proceedings of the IEEE 13th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4 . https://doi.org/10.1109/ATEE58038.2023.10108130.
 
Dinčić, M., Perić, Z., Tančić, M., Denić, D., Stamenković, Z., Denić, B. (2021). Support region of μ-law logarithmic quantizers for Laplacian source applied in neural networks. Microelectronics Reliability, 124, 114269.
 
Dinčić, M., Perić, Z., Jovanović, A. (2016). New coding algorithm based on variable-length codewords for piecewise uniform quantizers. Informatica, 27(3), 527–548. https://doi.org/10.15388/Informatica.2016.98.
 
Fasi, M., Mikaitis, M. (2021). Algorithms for stochastically rounded elementary arithmetic operations in IEEE 754 floating-point arithmetic. IEEE Transactions on Emerging Topics in Computing, 9(3), 1451–1466. https://doi.org/10.1109/TETC.2021.3069165.
 
Gazor, S., Zhang, W. (2003). Speech probability distribution. IEEE Signal Processing Letters, 10(7), 204–207. https://doi.org/10.1109/LSP.2003.813679.
 
Gersho, A., Gray, R. (1992). Vector Quantization and Signal Compression. Kluwer Academic Publishers, New York.
 
IEEE 754 (2019). IEEE Standard for Floating Point Arithmetic.
 
Jayant, N.C., Noll, P. (1984). Digital Coding of Waveforms: Principles and Applications to Speech and Video. Prentice Hall, New Jersey.
 
Junaid, M., Arslan, S., Lee, T., Kim, H. (2022). Optimal architecture of floating-point arithmetic for neural network training processor. Sensors, 22, 1230. https://doi.org/10.3390/s22031230.
 
Moroz, L., Samotyy, V. (2019). Efficient floating-point division for digital signal processing application. IEEE Signal Processing Magazine, 36(1), 159–163. https://doi.org/10.1109/MSP.2018.2875977.
 
Na, S. (2011). Asymptotic formulas for variance-mismatched fixed-rate scalar quantization of a Gaussian source. IEEE Transactions on Signal Processing, 59(5), 2437–2441. https://doi.org/10.1109/TSP.2011.2112354.
 
Perić, Z., Dinčić, M., Denić, D., Jocić, A. (2010). Forward adaptive logarithmic quantizer with new lossless coding method for Laplacian source. Wireless Personal Communications, 59(4), 625–641. https://doi.org/10.1007/s11277-010-9929-3.
 
Perić, Z., Savić, M., Dinčić, M., Vučić, N., Djošić, D., Milosavljević, S. (2021). Floating point and fixed point 32-bits quantizers for quantization of weights of neural networks. In: Proceedings of the IEEE 12th International Symposium on Advanced Topics in Electrical Engineering (ATEE), Bucharest, Romania, pp. 1–4. https://doi.org/10.1109/ATEE52255.2021.9425265.
 
Syed, R.T., Ulbricht, M., Piotrowski, K., Krstic, M. (2021). Fault resilience analysis of quantized deep neural networks. In: Proceedings of the IEEE 32nd International Conference on Microelectronics (MIEL), Niš, Serbia, pp. 275–294. https://doi.org/10.1109/MIEL52794.2021.9569094.
 
Wang, N., Choi, J., Brand, D., Chen, C.Y., Gopalakrishnan, K. (2018). Training deep neural networks with 8-bit floating point numbers. In: Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada, 2018, pp. 7686–7695.
 
Yang, Y., Chi, X., Deng, L., Yan, T., Gao, F., Li, G. (2022). Towards efficient full 8-bit integer DNN online training on resource-limited devices without batch normalization. Neurocomputing, 511, 175–186. https://doi.org/10.1016/j.neucom.2022.08.045.
 
Zhao, W., Dang, Q., Xia, T., Zhang, J., Zheng, N., Ren, P. (2023). Optimizing FPGA-based DNN accelerator with shared exponential floating-point format. IEEE Transactions on Circuits and Systems I, 70(11), 4478–4491. https://doi.org/10.1109/TCSI.2023.3300657.

Biographies

Perić Zoran
zoran.peric@elfak.ni.ac.rs

Z. Perić was born in Niš, Serbia, in 1964. He received the BS, MS and PhD degrees from the Faculty of Electronic Engineering, University of Niš, Serbia, in 1989, 1994 and 1999, respectively. He is a full-time professor at Department of Telecommunications, Faculty of Electronic Engineering, University of Niš. His current research interests include the information theory and signal processing. He is an author and co-author of over 350 papers. Dr. Peric has been a reviewer of a number of journals, including IEEE Transactions on Information Theory, IEEE Transactions on Signal Processing, IEEE Transactions on Communications, Compel, Informatica, Information Technology and Control, Expert Systems with Applications and Digital Signal Processing.

Denić Bojan
bojan.denic@elfak.ni.ac.rs

B. Denić received his PhD degree in the field of Telecommunications in 2023 from the Faculty of Electronic Engineering, University of Niš, Serbia. Currently, he is working as a research associate at the same faculty. His main research interests include signal processing, quantization and machine learning. He is an author of 34 scientific papers (18 of them in peer-reviewed international journals).

Dinčić Milan
milan.dincic@elfak.ni.ac.rs

M. Dinčić received MSc in 2007, PhD in the field of Telecommunication in 2012 and PhD in the field of Measurements in 2017 from the University of Niš. Currently, he is working as an associate professor at the Faculty of Electronic Engineering. He is an author of 64 scientific papers (35 of them in reputable international journals with IF from the SCI/SCIe list). His research is related to quantization and compression of neural networks, sensors and measurement systems.

Perić Sofija
sofija.peric@elfak.ni.ac.rs

M. Dinčić received MSc in 2007, PhD in the field of Telecommunication in 2012 and PhD in the field of Measurements in 2017 from the University of Niš. Currently, he is working as an associate professor at the Faculty of Electronic Engineering. He is an author of 64 scientific papers (35 of them in reputable international journals with IF from the SCI/SCIe list). His research is related to quantization and compression of neural networks, sensors and measurement systems.


Full article Related articles PDF XML
Full article Related articles PDF XML

Copyright
© 2025 Vilnius University
by logo by logo
Open access article under the CC BY license.

Keywords
floating-point format piecewise uniform quantization μ-law companding quantization Laplacian source

Funding
This work was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia [grant number 451-03-65/2024-03/200102], as well as by the European Union’s Horizon 2023 research and innovation programme through the AIDA4Edge Twinning project (grant ID 101160293).

Metrics
since January 2020
212

Article info
views

68

Full article
views

51

PDF
downloads

21

XML
downloads

Export citation

Copy and paste formatted citation
Placeholder

Download citation in file


Share


RSS

INFORMATICA

  • Online ISSN: 1822-8844
  • Print ISSN: 0868-4952
  • Copyright © 2023 Vilnius University

About

  • About journal

For contributors

  • OA Policy
  • Submit your article
  • Instructions for Referees
    •  

    •  

Contact us

  • Institute of Data Science and Digital Technologies
  • Vilnius University

    Akademijos St. 4

    08412 Vilnius, Lithuania

    Phone: (+370 5) 2109 338

    E-mail: informatica@mii.vu.lt

    https://informatica.vu.lt/journal/INFORMATICA
Powered by PubliMill  •  Privacy policy