Source author record

Tobias Grosser

Tobias Grosser appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Programming Languages Distributed, Parallel, and Cluster Computing Performance Emerging Technologies Mathematical Software

Catalog footprint

What is connected

8works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Lambda the Ultimate SSA: Optimizing Functional Programs in SSA

Static Single Assignment (SSA) is the workhorse of modern optimizing compilers for imperative programming languages. However, functional languages have been slow to adopt SSA and prefer to use intermediate representations based on minimal lambda calculi due to SSA's inability to express higher order constructs. We exploit a new SSA construct -- regions -- in order to express functional optimizations via classical SSA based reasoning. Region optimization currently relies on ad-hoc analyses and transformations on imperative programs. These ad-hoc transformations are sufficient for imperative languages as regions are used in a limited fashion. In contrast, we use regions pervasively to model sub-expressions in our functional IR. This motivates us to systematize region optimizations. We extend classical SSA reasoning to regions for functional-style analyses and transformations. We implement a new SSA+regions based backend for LEAN4, a theorem prover that implements a purely functional, dependently typed programming language. Our backend is feature-complete and handles all constructs of LEAN4's functional intermediate representation λrc within the SSA framework. We evaluate our proposed region optimizations by optimizing λrc within an SSA+regions based framework implemented in MLIR and demonstrating performance parity with the current LEAN4 backend. We believe our work will pave the way for a unified optimization framework capable of representing, analyzing, and optimizing both functional and imperative languages.

preprint2022arXiv

MOM: Matrix Operations in MLIR

Modern research in code generators for dense linear algebra computations has shown the ability to produce optimized code with a performance which compares and often exceeds the one of state-of-the-art implementations by domain experts. However, the underlying infrastructure is often developed in isolation making the interconnection of logically combinable systems complicated if not impossible. In this paper, we propose to leverage MLIR as a unifying compiler infrastructure for the optimization of dense linear algebra operations. We propose a new MLIR dialect for expressing linear algebraic computations including matrix properties to enable high-level algorithmic transformations. The integration of this new dialect in MLIR enables end-to-end compilation of matrix computations via conversion to existing lower-level dialects already provided by the framework.

preprint2020arXiv

A Fast Analytical Model of Fully Associative Caches

While the cost of computation is an easy to understand local property, the cost of data movement on cached architectures depends on global state, does not compose, and is hard to predict. As a result, programmers often fail to consider the cost of data movement. Existing cache models and simulators provide the missing information but are computationally expensive. We present a lightweight cache model for fully associative caches with least recently used (LRU) replacement policy that gives fast and accurate results. We count the cache misses without explicit enumeration of all memory accesses by using symbolic counting techniques twice: 1) to derive the stack distance for each memory access and 2) to count the memory accesses with stack distance larger than the cache size. While this technique seems infeasible in theory, due to non-linearities after the first round of counting, we show that the counting problems are sufficiently linear in practice. Our cache model often computes the results within seconds and contrary to simulation the execution time is mostly problem size independent. Our evaluation measures modeling errors below 0.6% on real hardware. By providing accurate data placement information we enable memory hierarchy aware software development.

preprint2020arXiv

Compiling Neural Networks for a Computational Memory Accelerator

Computational memory (CM) is a promising approach for accelerating inference on neural networks (NN) by using enhanced memories that, in addition to storing data, allow computations on them. One of the main challenges of this approach is defining a hardware/software interface that allows a compiler to map NN models for efficient execution on the underlying CM accelerator. This is a non-trivial task because efficiency dictates that the CM accelerator is explicitly programmed as a dataflow engine where the execution of the different NN layers form a pipeline. In this paper, we present our work towards a software stack for executing ML models on such a multi-core CM accelerator. We describe an architecture for the hardware and software, and focus on the problem of implementing the appropriate control logic so that data dependencies are respected. We propose a solution to the latter that is based on polyhedral compilation.

preprint2020arXiv

Domain-Specific Multi-Level IR Rewriting for GPU

Traditional compilers operate on a single generic intermediate representation (IR). These IRs are usually low-level and close to machine instructions. As a result, optimizations relying on domain-specific information are either not possible or require complex analysis to recover the missing information. In contrast, multi-level rewriting instantiates a hierarchy of dialects (IRs), lowers programs level-by-level, and performs code transformations at the most suitable level. We demonstrate the effectiveness of this approach for the weather and climate domain. In particular, we develop a prototype compiler and design stencil- and GPU-specific dialects based on a set of newly introduced design principles. We find that two domain-specific optimizations (500 lines of code) realized on top of LLVM's extensible MLIR compiler infrastructure suffice to outperform state-of-the-art solutions. In essence, multi-level rewriting promises to herald the age of specialized compilers composed from domain- and target-specific dialects implemented on top of a shared infrastructure.

preprint2020arXiv

Extracting Clean Performance Models from Tainted Programs

Performance models are well-known instruments to understand the scaling behavior of parallel applications. They express how performance changes as key execution parameters, such as the number of processes or the size of the input problem, vary. Besides reasoning about program behavior, such models can also be automatically derived from performance data. This is called empirical performance modeling. While this sounds simple at the first glance, this approach faces several serious interrelated challenges, including expensive performance measurements, inaccuracies inflicted by noisy benchmark data, and overall complex experiment design, starting with the selection of the right parameters. The more parameters one considers, the more experiments are needed and the stronger the impact of noise. In this paper, we show how taint analysis, a technique borrowed from the domain of computer security, can substantially improve the modeling process, lowering its cost, improving model quality, and help validate performance models and experimental setups.

preprint2020arXiv

LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

Modern Hardware Description Languages (HDLs) such as SystemVerilog or VHDL are, due to their sheer complexity, insufficient to transport designs through modern circuit design flows. Instead, each design automation tool lowers HDLs to its own Intermediate Representation (IR). These tools are monolithic and mostly proprietary, disagree in their implementation of HDLs, and while many redundant IRs exists, no IR today can be used through the entire circuit design flow. To solve this problem, we propose the LLHD multi-level IR. LLHD is designed as simple, unambiguous reference description of a digital circuit, yet fully captures existing HDLs. We show this with our reference compiler on designs as complex as full CPU cores. LLHD comes with lowering passes to a hardware-near structural IR, which readily integrates with existing tools. LLHD establishes the basis for innovation in HDLs and tools without redundant compilers or disjoint IRs. For instance, we implement an LLHD simulator that runs up to 2.4x faster than commercial simulators but produces equivalent, cycle-accurate results. An initial vertically-integrated research prototype is capable of representing all levels of the IR, implements lowering from the behavioural to the structural IR, and covers a sufficient subset of SystemVerilog to support a full CPU design.

preprint2013arXiv

PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs

We motivate the design and implementation of a platform-neutral compute intermediate language (PENCIL) for productive and performance-portable accelerator programming.

Tobias Grosser

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Lambda the Ultimate SSA: Optimizing Functional Programs in SSA

MOM: Matrix Operations in MLIR

A Fast Analytical Model of Fully Associative Caches

Compiling Neural Networks for a Computational Memory Accelerator

Domain-Specific Multi-Level IR Rewriting for GPU

Extracting Clean Performance Models from Tainted Programs

LLHD: A Multi-level Intermediate Representation for Hardware Description Languages

PENCIL: Towards a Platform-Neutral Compute Intermediate Language for DSLs