• autopin – Automated Optimization of Thread-to-Core Pinning on Multicore Systems

      Klug, Tobias; Ott, Michael; Weidendorfer, Josef; Trinitis, Carsten (Springer, 2011)
      In this paper we present a framework for automatic detection and application of the best binding between threads of a running parallel application and processor cores in a shared memory system, by making use of hardware performance counters. This is especially important within the scope of multicore architectures with shared cache levels. We demonstrate that many applications from the SPEC OMP benchmark show quite sensitive runtime behavior depending on the thread/core binding used. In our tests, the proposed framework is able to find the best binding in nearly all cases. The proposed framework is intended to supplement job scheduling systems for better automatic exploitation of systems with multicore processors, as well as making programmers aware of this issue by providing measurement logs.
    • Off-loading application controlled data prefetching in numerical codes for multi-core processors

      Trinitis, Carsten; Weidendorfer, Josef (Inderscience, 2008-11)
      An important issue when designing numerical code in High Performance Computing is cache optimisation in order to exploit the performance potential of a given target architecture. This includes techniques to improve memory access locality as well as prefetching. Inherent algorithm constrains often limit the first approach, which typically uses a blocking technique. While there exist automatic prefetching mechanisms in hardware and/or compilers, they can not complement blocking with additional prefetching. We provide an infrastructure for off-loading application controlled prefetching on a chip multiprocessor, allowing to further improve numerical code already optimised by standard cache optimisation. Clear benefits are shown for real workloads on existing hardware.
    • Parallel MLEM on multicore architectures

      Kustner, Tilman; Weidendorfer, Josef; Schirmer, Jasmine; Klug, Tobias; Trinitis, Carsten; Ziegler, Sybille; Technische Universität München (Springer, 2009)
      The efficient use of multicore architectures for sparse matrix-vector multiplication (SpMV) is currently an open challenge. One algorithm which makes use of SpMV is the maximum likelihood expectation maximization (MLEM) algorithm. When using MLEM for positron emission tomography (PET) image reconstruction, one requires a particularly large matrix. We present a new storage scheme for this type of matrix which cuts the memory requirements by half, compared to the widely-used compressed sparse row format. For parallelization we combine the two partitioning techniques recursive bisection and striping. Our results show good load balancing and cache behavior. We also give speedup measurements on various modern multicore systems.
    • Sparse matrix operations on several multi-core architectures

      Trinitis, Carsten; Küstner, Tilman; Weidendorfer, Josef; Smajic, Jasmin (SpringerLink, 2010)
      This paper compares various contemporary multicore-based microprocessor architectures from different vendors with different memory interconnects regarding performance, speedup, and parallel efficiency. Sparse matrix decomposition is used as a benchmark application. The example matrix used in the experiments comes from an electrical engineering application, where numerical simulation of physical processes plays an important role in the design of industrial products. Within this context, thread-to-core pinning and cache optimization are two important aspects which are investigated in more detail.