Increased computational performance for vector operations on BLAS-1

Authors

  • José Antonio Muñoz Gómez Universidad de Guadalajara, México
  • Abimael Jiménez Pérez Universidad Autónoma de Ciudad Juárez, México
  • Gustavo Rodríguez Gómez Instituto Nacional de Astrofísica, Óptica y Electrónica, México

Keywords:

Scientific computing, BLAS-1, unroll technique, vector programming

Abstract

The functions library, called Basic Linear Algebra Subprograms (BLAS-1), is considered the programming standard in scientific computing. In this work, we focus on the analysis of various code optimization techniques to increase the computational performance of BLAS-1. In particular, we address a combinational approach to explore possible methods of encoding using unroll technique with di
erent levels of depth, vector data programming with MMX and SSE for Intel processors. Using the main functions of BLAS-1, it was determined numerically a computational increase, expressed in mega-ops, up to 52% compared to the optimized BLAS-1 ATLAS library

Downloads

Download data is not yet available.

Author Biographies

José Antonio Muñoz Gómez, Universidad de Guadalajara, México

Departamento de Ingeniería, Universidad de Guadalajara, Jalisco

Abimael Jiménez Pérez, Universidad Autónoma de Ciudad Juárez, México

Departamento de Electrónica y Computación

Gustavo Rodríguez Gómez, Instituto Nacional de Astrofísica, Óptica y Electrónica, México

Departamento de Ciencias Computacionales

References

Aiken, A., y Nicolau, A. (1987). Loop quantization: An analysis and algorithm. Department of Computer Science, Cornell University.
Bouhamidi, A., Hached, M., y Jbilou, K. (2013). A meshless method for the numerical computation of the solution of steady burgerstype equations. Applied Numerical Mathematics, 74 (0), 95 - 110.
Chisnall, D. (2007, marzo). Programming with gcc. InformIT Article is provided courtesy of Prentice Hall Professional.
Davidson, J. W., y Jinturkar, S. (1995). Improving instruction-level parallelism by loop unrolling and dynamic memory disambiguation. En In proceedings of the 28th annual international symposium on microarchitecture (pp. 125-132). IEEE Computer Society.
Golub, G. H., y Loan, C. F. V. (1996). Matrix computations (3rd ed.). The Johns Hopkins University Press.
Goto, K., y Van De Geijn, R. (2008). High-performance implementation of the level-3 blas. ACM Trans. Math. Softw., 35 (1), 4:1-4:14.
Hennessy, J. L., y Patterson, D. A. (2003). Computer architecture: A quantitative approach (3.a ed.). San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.
Higham, N. J. (2002). Accuracy and stability of numerical algorithms (2. ed.). SIAM.
Inc., I. (2012, abril). Intel R 64 and ia-32 architectures optimization reference manual (Vol. A) [Manual de software inform´atico].
Lawson, C. L., Hanson, R. J., Kincaid, D. R., y Krogh, F. T. (1979, septiembre). Basic linear algebra subprograms for fortran usage. ACM Trans. Math. Softw., 5 (3), 308-323.
Mansour, A., y Gtze, J. (2013). Utilizing robustness of krylov subspace methods in reducing the effort of sparse matrix vector multiplication. Procedia Computer Science, 18 (0), 2406 - 2409. (2013 International Conference on Computational Science).
Mittal, M., Peleg, A., y Weiser, U. (1997). Mmx technology architecture overview. (Q3).
Napoli, E. D., Fabregat-Traver, D., Quintana-Ort, G., y Bientinesi, P. (2014). Towards an efficient use of the {BLAS} library for multilinear tensor contractions. Applied Mathematics and Computation, 235 (0), 454 - 468.
Trefethen, L. N., y Bau, D. (1997). Numerical linear algebra. SIAM.
Van Loan, C. F. (1999). Introduction to scientific computing. Prentice-Hall. Wang, Q., Zhang, X., Zhang, Y., y Yi, Q. (2013). Augem: automatically generate high performance dense linear algebra kernels on x86 cpus. En W. Gropp y S. Matsuoka (Eds.), Sc (p. 25). ACM. Whaley, R. C., y Dongarra, J. J. (1997). Automatically tuned linear algebra software (Inf. T´ec.). Knoxville, TN, USA: University of Tennessee.
Yzelman, A.-J. N., Roose, D., y Meerbergen, K. (2015). Chapter 27 - sparse matrix-vector multiplication: Parallelization and vectorization. En J. R. Jeffers (Ed.), High performance parallelism pearls (p. 457 - 476). Boston: Morgan Kaufmann

Published

2015-01-31

How to Cite

[1]
J. A. Muñoz Gómez, A. Jiménez Pérez, and G. Rodríguez Gómez, “Increased computational performance for vector operations on BLAS-1”, Publ.Cienc.Tecnol, vol. 8, no. 1, pp. 31-44, Jan. 2015.

Issue

Section

Research Article