Source author record

Jürgen Teich

Jürgen Teich appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Distributed, Parallel, and Cluster Computing Programming Languages Computer Vision Cryptography and Security Operating Systems Databases eess.IV Robotics Software Engineering

Catalog footprint

What is connected

14works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Efficient Table-based Function Approximation on FPGAs using Interval Splitting and BRAM Instantiation

This paper proposes a novel approach for the generation of memory-efficient table-based function approximation circuits for FPGAs. Given a function f(x) to be approximated in a given interval [x0,x0+a] and a maximum approximation error Ea, the goal is to determine a function table implementation with a minimized memory footprint, i.e., number of entries that need to be stored. Rather than state-of-the-art work performing an even sampling of the given interval by so-called breakpoints and using linear interpolation between two adjacent breakpoints to determine f(x) at the maximum error bound, first, we propose three interval-splitting algorithms to reduce the required memory footprint drastically based on the observation that in sub-intervals of low gradient, a coarser sampling grid may be assumed to satisfy the maximum interpolation error bound. Experiments on elementary mathematical functions show that a large fraction in memory footprint may be saved. Second, a hardware architecture implementing the sub-interval selection, breakpoint lookup and interpolation at a latency of just 9 clock cycles is introduced. Third, within each generated circuit design, BRAMs are automatically instantiated rather than synthesizing the reduced footprint function table using LUT primitives providing an additional degree of resource efficiency.

preprint2022arXiv

Multi-Objective Design Space Exploration for the Optimization of the HEVC Mode Decision Process

Finding the best possible encoding decisions for compressing a video sequence is a highly complex problem. In this work, we propose a multi-objective Design Space Exploration (DSE) method to automatically find HEVC encoder implementations that are optimized for several different criteria. The DSE shall optimize the coding mode evaluation order of the mode decision process and jointly explore early skip conditions to minimize the four objectives a) bitrate, b) distortion, c) encoding time, and d) decoding energy. In this context, we use a SystemC-based actor model of the HM test model encoder for the evaluation of each explored solution. The evaluation that is based on real measurements shows that our framework can automatically generate encoder solutions that save more than 60% of encoding time or 3% of decoding energy when accepting bitrate increases of around 3%.

preprint2022arXiv

Raw Filtering of JSON Data on FPGAs

Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.

preprint2022arXiv

Real-Time Waveform Matching with a Digitizer at 10 GS/s

Side-Channel Analysis (SCA) requires the detection of the specific time frame Cryptographic Operations (COs) takeplace in the side-channel signal. In laboratory conditions with full control over the Device under Test (DuT), dedicated trigger signals can be implemented to indicate the start and end of COs. For real-world scenarios, waveform-matching techniques have been established which compare the side-channel signal with a template of the CO's pattern in real time to detect the CO in the side channel. State-of-the-art approaches are implemented on Field-Programmable Gate Arrays (FPGAs). However, current waveform-matching designs are processing the samples from Analog-to-Digital Converters (ADCs) sequentially and can only work with low sampling rates due to the limited clock speed of FPGAs. This makes it increasingly difficult to apply existing techniques on modern DuTs that are operating with clock speeds in the GHz range. In this paper, we present a parallel waveform-matching architecture that is capable of performing waveform matching at the speed of fast ADCs. We implement the proposed architecture in a high-end FPGA-based digitizer and apply it to detect AES COs from the side channel of a single-board computer operating at 1 GHz. Our implementation allows for waveform matching at 10 GS/s with high accuracy, thus offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us.

preprint2021arXiv

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Loop compilation for Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, entails solving NP-hard problems, yet depends on the loop bounds and number of available processing elements (PEs), parameters known only at runtime because of dynamic resource management and input sizes. Therefore, this article proposes a two-phase approach called symbolic loop compilation: At compile time, the necessary NP-complete problems are solved and the solutions compiled into a space-efficient symbolic configuration. At runtime, a concrete configuration is generated from the symbolic configuration according to the parameters values. We show that the latter phase, called instantiation, runs in polynomial time with its most complex step, program instantiation, not depending on the number of PEs. As validation, we performed symbolic loop compilation on real-world loops and measured time and space requirements. Our experiments confirm that a symbolic configuration is space-efficient and suited for systems with little memory -- often, a symbolic configuration is smaller than a single concrete configuration -- and that program instantiation scales well with the number of PEs -- for example, when instantiating a symbolic configuration of a matrix-matrix multiplication, the execution time is similar for $4\times 4$ and $32\times 32$ PEs.

preprint2020arXiv

AnyHLS: High-Level Synthesis with Partial Evaluation

FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. Programmers usually write pragma-annotated C/C++ programs to define the hardware architecture of an application. However, each hardware vendor extends its own C dialect using its own vendor-specific set of pragmas. This prevents portability across different vendors. Furthermore, pragmas are not first-class citizens in the language. This makes it hard to use them in a modular way or design proper abstractions. In this paper, we present AnyHLS, an approach to synthesize FPGA designs in a modular and abstract way. AnyHLS is able to raise the abstraction level of existing HLS tools by resorting to programming language features such as types and higher-order functions as follows: It relies on partial evaluation to specialize and to optimize the user application based on a library of abstractions. Then, vendor-specific HLS code is generated for Intel and Xilinx FPGAs. Portability is obtained by avoiding any vendor-specific pragmas at the source code. In order to validate achievable gains in productivity, a library for the domain of image processing is introduced as a case study, and its synthesis results are compared with several state-of-theart Domain-Specific Language (DSL) approaches for this domain.

preprint2020arXiv

HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX' algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing Domain-Specific Language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.

preprint2020arXiv

Isolation-Aware Timing Analysis and Design Space Exploration for Predictable and Composable Many-Core Systems

Composable many-core systems enable the independent development and analysis of applications which will be executed on a shared platform where the mix of concurrently executed applications may change dynamically at run time. For each individual application, an off-line Design Space Exploration (DSE) is performed to compute several mapping alternatives on the platform, offering Pareto-optimal trade-offs in terms of real-time guarantees, resource usage, etc. At run time, one mapping is then chosen to launch the application on demand. In this context, to enable an independent analysis of each individual application at design time, so-called inter-application isolation schemes are applied which specify temporal or spatial isolation policies between applications. S.o.t.a. composable many-core systems are developed based on a fixed isolation scheme that is exclusively applied to every resource in every mapping of every application and use a timing analysis tailored to that isolation scheme to derive timing guarantees for each mapping. A fixed isolation scheme, however, heavily restricts the explored space of solutions and can, therefore, lead to suboptimality. Lifting this restriction necessitates a timing analysis that is applicable to mappings with an arbitrary mix of isolation schemes on different resources. To address this issue, we present an isolation-aware timing analysis that unlike existing analyses can handle multiple isolation schemes in combination within one mapping and delivers safe yet tight timing bounds by identifying and excluding interference scenarios that can never happen under the given combination of isolation schemes. Based on the timing analysis, we present a DSE which explores the choices of isolation scheme per resource within each mapping. Experimental results demonstrate the advantage of the proposed approach over approaches based on a fixed isolation scheme.

preprint2020arXiv

Secure Boot from Non-Volatile Memory for Programmable SoC Architectures

In modern embedded systems, the trust in comprehensive security standards all along the product life cycle has become an increasingly important access-to-market requirement. However, these security standards rely on mandatory immunity assumptions such as the integrity and authenticity of an initial system configuration typically loaded from Non-Volatile Memory (NVM). This applies especially to FPGA-based Programmable System-on-Chip (PSoC) architectures, since object codes as well as configuration data easily exceed the capacity of a secure bootROM. In this context, an attacker could try to alter the content of the NVM device in order to manipulate the system. The PSoC therefore relies on the integrity of the NVM particularly at boot-time. In this paper, we propose a methodology for securely booting from an NVM in a potentially unsecure environment by exploiting the reconfigurable logic of the FPGA. Here, the FPGA serves as a secure anchor point by performing required integrity and authenticity verifications prior to the configuration and execution of any user application loaded from the NVM on the PSoC. The proposed secure boot process is based on the following assumptions and steps: 1) The boot configurationis stored on a fully encrypted Secure Digital memory card (SD card) or alternatively Flash acting as NVM. 2) At boot time, a hardware design called Trusted Memory-Interface Unit (TMIU) is loaded to verify first the authenticity of the deployed NVM and then after decryption the integrity of its content. To demonstrate the practicability of our approach, we integrated the methodology into the vendor-specific secure boot process of a Xilinx Zynq PSoC and evaluated the design objectives performance, power and resource costs.

preprint2015arXiv

Automatic Optimization of Hardware Accelerators for Image Processing

In the domain of image processing, often real-time constraints are required. In particular, in safety-critical applications, such as X-ray computed tomography in medical imaging or advanced driver assistance systems in the automotive domain, timing is of utmost importance. A common approach to maintain real-time capabilities of compute-intensive applications is to offload those computations to dedicated accelerator hardware, such as Field Programmable Gate Arrays (FPGAs). Programming such architectures is a challenging task, with respect to the typical FPGA-specific design criteria: Achievable overall algorithm latency and resource usage of FPGA primitives (BRAM, FF, LUT, and DSP). High-Level Synthesis (HLS) dramatically simplifies this task by enabling the description of algorithms in well-known higher languages (C/C++) and its automatic synthesis that can be accomplished by HLS tools. However, algorithm developers still need expert knowledge about the target architecture, in order to achieve satisfying results. Therefore, in previous work, we have shown that elevating the description of image algorithms to an even higher abstraction level, by using a Domain-Specific Language (DSL), can significantly cut down the complexity for designing such algorithms for FPGAs. To give the developer even more control over the common trade-off, latency vs. resource usage, we will present an automatic optimization process where these criteria are analyzed and fed back to the DSL compiler, in order to generate code that is closer to the desired design specifications. Finally, we generate code for stereo block matching algorithms and compare it with handwritten implementations to quantify the quality of our results.

preprint2014arXiv

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs

Multiresolution Analysis (MRA) is a mathematical method that is based on working on a problem at different scales. One of its applications is medical imaging where processing at multiple scales, based on the concept of Gaussian and Laplacian image pyramids, is a well-known technique. It is often applied to reduce noise while preserving image detail on different levels of granularity without modifying the filter kernel. In scientific computing, multigrid methods are a popular choice, as they are asymptotically optimal solvers for elliptic Partial Differential Equations (PDEs). As such algorithms have a very high computational complexity that would overwhelm CPUs in the presence of real-time constraints, application-specific processors come into consideration for implementation. Despite of huge advancements in leveraging productivity in the respective fields, designers are still required to have detailed knowledge about coding techniques and the targeted architecture to achieve efficient solutions. Recently, the HIPAcc framework was proposed as a means for automatic code generation of image processing algorithms, based on a Domain-Specific Language (DSL). From the same code base, it is possible to generate code for efficient implementations on several accelerator technologies including different types of Graphics Processing Units (GPUs) as well as reconfigurable logic (FPGAs). In this work, we demonstrate the ability of HIPAcc to generate code for the implementation of multiresolution applications on FPGAs and embedded GPUs.

preprint2014arXiv

Massively Parallel Processor Architectures for Resource-aware Computing

We present a class of massively parallel processor architectures called invasive tightly coupled processor arrays (TCPAs). The presented processor class is a highly parameterizable template, which can be tailored before runtime to fulfill costumers' requirements such as performance, area cost, and energy efficiency. These programmable accelerators are well suited for domain-specific computing from the areas of signal, image, and video processing as well as other streaming processing applications. To overcome future scaling issues (e.g., power consumption, reliability, resource management, as well as application parallelization and mapping), TCPAs are inherently designed in a way to support self-adaptivity and resource awareness at hardware level. Here, we follow a recently introduced resource-aware parallel computing paradigm called invasive computing where an application can dynamically claim, execute, and release resources. Furthermore, we show how invasive computing can be used as an enabler for power management. Finally, we will introduce ideas on how to realize fault-tolerant loop execution on such massively parallel architectures through employing on-demand spatial redundancies at the processor array level.

preprint2014arXiv

Proceedings of the First Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014)

This volume contains the papers accepted at the 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014), held in Paderborn, Germany, May 29-30, 2014. Racing 2014 was co-located with the IEEE European Test Symposium (ETS).

preprint2013arXiv

Invasive Computing - Common Terms and Granularity of Invasion

Future MPSoCs with 1000 or more processor cores on a chip require new means for resource-aware programming in order to deal with increasing imperfections such as process variation, fault rates, aging effects, and power as well as thermal problems. On the other hand, predictable program executions are threatened if not impossible if no proper means of resource isolation and exclusive use may be established on demand. In view of these problems and menaces, invasive computing enables an application programmer to claim for processing resources and spread computations to claimed processors dynamically at certain points of the program execution. Such decisions may be depending on the degree of application parallelism and the state of the underlying resources such as utilization, load, and temperature, but also with the goal to provide predictable program execution on MPSoCs by claiming processing resources exclusively as the default and thus eliminating interferences and creating the necessary isolation between multiple concurrently running applications. For achieving this goal, invasive computing introduces new programming constructs for resource-aware programming that meanwhile, for testing purpose, have been embedded into the parallel computing language X10 as developed by IBM using a library-based approach. This paper presents major ideas and common terms of invasive computing as investigated by the DFG Transregional Collaborative Research Centre TR89. Moreoever, a reflection is given on the granularity of resources that may be requested by invasive programs.

Jürgen Teich

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Efficient Table-based Function Approximation on FPGAs using Interval Splitting and BRAM Instantiation

Multi-Objective Design Space Exploration for the Optimization of the HEVC Mode Decision Process

Raw Filtering of JSON Data on FPGAs

Real-Time Waveform Matching with a Digitizer at 10 GS/s

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

AnyHLS: High-Level Synthesis with Partial Evaluation

HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Isolation-Aware Timing Analysis and Design Space Exploration for Predictable and Composable Many-Core Systems

Secure Boot from Non-Volatile Memory for Programmable SoC Architectures

Automatic Optimization of Hardware Accelerators for Image Processing

Code Generation for High-Level Synthesis of Multiresolution Applications on FPGAs

Massively Parallel Processor Architectures for Resource-aware Computing

Proceedings of the First Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014)

Invasive Computing - Common Terms and Granularity of Invasion