Researcher profile

Jürgen Teich

Jürgen Teich contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

Efficient Table-based Function Approximation on FPGAs using Interval Splitting and BRAM Instantiation

This paper proposes a novel approach for the generation of memory-efficient table-based function approximation circuits for FPGAs. Given a function f(x) to be approximated in a given interval [x0,x0+a] and a maximum approximation error Ea, the goal is to determine a function table implementation with a minimized memory footprint, i.e., number of entries that need to be stored. Rather than state-of-the-art work performing an even sampling of the given interval by so-called breakpoints and using linear interpolation between two adjacent breakpoints to determine f(x) at the maximum error bound, first, we propose three interval-splitting algorithms to reduce the required memory footprint drastically based on the observation that in sub-intervals of low gradient, a coarser sampling grid may be assumed to satisfy the maximum interpolation error bound. Experiments on elementary mathematical functions show that a large fraction in memory footprint may be saved. Second, a hardware architecture implementing the sub-interval selection, breakpoint lookup and interpolation at a latency of just 9 clock cycles is introduced. Third, within each generated circuit design, BRAMs are automatically instantiated rather than synthesizing the reduced footprint function table using LUT primitives providing an additional degree of resource efficiency.

preprint2022arXiv

Multi-Objective Design Space Exploration for the Optimization of the HEVC Mode Decision Process

Finding the best possible encoding decisions for compressing a video sequence is a highly complex problem. In this work, we propose a multi-objective Design Space Exploration (DSE) method to automatically find HEVC encoder implementations that are optimized for several different criteria. The DSE shall optimize the coding mode evaluation order of the mode decision process and jointly explore early skip conditions to minimize the four objectives a) bitrate, b) distortion, c) encoding time, and d) decoding energy. In this context, we use a SystemC-based actor model of the HM test model encoder for the evaluation of each explored solution. The evaluation that is based on real measurements shows that our framework can automatically generate encoder solutions that save more than 60% of encoding time or 3% of decoding energy when accepting bitrate increases of around 3%.

preprint2022arXiv

Raw Filtering of JSON Data on FPGAs

Many Big Data applications include the processing of data streams on semi-structured data formats such as JSON. A disadvantage of such formats is that an application may spend a significant amount of processing time just on unselectively parsing all data. To relax this issue, the concept of raw filtering is proposed with the idea to remove data from a stream prior to the costly parsing stage. However, as accurate filtering of raw data is often only possible after the data has been parsed, raw filters are designed to be approximate in the sense of allowing false-positives in order to be implemented efficiently. Contrary to previously proposed CPU-based raw filtering techniques that are restricted to string matching, we present FPGA-based primitives for filtering strings, numbers and also number ranges. In addition, a primitive respecting the basic structure of JSON data is proposed that can be used to further increase the accuracy of introduced raw filters. The proposed raw filter primitives are designed to allow for their composition according to a given filter expression of a query. Thus, complex raw filters can be created for FPGAs which enable a drastical decrease in the amount of generated false-positives, particularly for IoT workload. As there exists a trade-off between accuracy and resource consumption, we evaluate primitives as well as composed raw filters using different queries from the RiotBench benchmark. Our results show that up to 94.3% of the raw data can be filtered without producing any observed false-positives using only a few hundred LUTs.

preprint2022arXiv

Real-Time Waveform Matching with a Digitizer at 10 GS/s

Side-Channel Analysis (SCA) requires the detection of the specific time frame Cryptographic Operations (COs) takeplace in the side-channel signal. In laboratory conditions with full control over the Device under Test (DuT), dedicated trigger signals can be implemented to indicate the start and end of COs. For real-world scenarios, waveform-matching techniques have been established which compare the side-channel signal with a template of the CO's pattern in real time to detect the CO in the side channel. State-of-the-art approaches are implemented on Field-Programmable Gate Arrays (FPGAs). However, current waveform-matching designs are processing the samples from Analog-to-Digital Converters (ADCs) sequentially and can only work with low sampling rates due to the limited clock speed of FPGAs. This makes it increasingly difficult to apply existing techniques on modern DuTs that are operating with clock speeds in the GHz range. In this paper, we present a parallel waveform-matching architecture that is capable of performing waveform matching at the speed of fast ADCs. We implement the proposed architecture in a high-end FPGA-based digitizer and apply it to detect AES COs from the side channel of a single-board computer operating at 1 GHz. Our implementation allows for waveform matching at 10 GS/s with high accuracy, thus offering a speedup of 50x compared to the fastest state-of-the-art implementation known to us.

preprint2021arXiv

Symbolic Loop Compilation for Tightly Coupled Processor Arrays

Loop compilation for Tightly Coupled Processor Arrays (TCPAs), a class of massively parallel loop accelerators, entails solving NP-hard problems, yet depends on the loop bounds and number of available processing elements (PEs), parameters known only at runtime because of dynamic resource management and input sizes. Therefore, this article proposes a two-phase approach called symbolic loop compilation: At compile time, the necessary NP-complete problems are solved and the solutions compiled into a space-efficient symbolic configuration. At runtime, a concrete configuration is generated from the symbolic configuration according to the parameters values. We show that the latter phase, called instantiation, runs in polynomial time with its most complex step, program instantiation, not depending on the number of PEs. As validation, we performed symbolic loop compilation on real-world loops and measured time and space requirements. Our experiments confirm that a symbolic configuration is space-efficient and suited for systems with little memory -- often, a symbolic configuration is smaller than a single concrete configuration -- and that program instantiation scales well with the number of PEs -- for example, when instantiating a symbolic configuration of a matrix-matrix multiplication, the execution time is similar for $4\times 4$ and $32\times 32$ PEs.

preprint2020arXiv

AnyHLS: High-Level Synthesis with Partial Evaluation

FPGAs excel in low power and high throughput computations, but they are challenging to program. Traditionally, developers rely on hardware description languages like Verilog or VHDL to specify the hardware behavior at the register-transfer level. High-Level Synthesis (HLS) raises the level of abstraction, but still requires FPGA design knowledge. Programmers usually write pragma-annotated C/C++ programs to define the hardware architecture of an application. However, each hardware vendor extends its own C dialect using its own vendor-specific set of pragmas. This prevents portability across different vendors. Furthermore, pragmas are not first-class citizens in the language. This makes it hard to use them in a modular way or design proper abstractions. In this paper, we present AnyHLS, an approach to synthesize FPGA designs in a modular and abstract way. AnyHLS is able to raise the abstraction level of existing HLS tools by resorting to programming language features such as types and higher-order functions as follows: It relies on partial evaluation to specialize and to optimize the user application based on a library of abstractions. Then, vendor-specific HLS code is generated for Intel and Xilinx FPGAs. Portability is obtained by avoiding any vendor-specific pragmas at the source code. In order to validate achievable gains in productivity, a library for the domain of image processing is introduced as a case study, and its synthesis results are compared with several state-of-theart Domain-Specific Language (DSL) approaches for this domain.

preprint2020arXiv

HipaccVX: Wedding of OpenVX and DSL-based Code Generation

Writing programs for heterogeneous platforms optimized for high performance is hard since this requires the code to be tuned at a low level with architecture-specific optimizations that are most times based on fundamentally differing programming paradigms and languages. OpenVX promises to solve this issue for computer vision applications with a royalty-free industry standard that is based on a graph-execution model. Yet, the OpenVX' algorithm space is constrained to a small set of vision functions. This hinders accelerating computations that are not included in the standard. In this paper, we analyze OpenVX vision functions to find an orthogonal set of computational abstractions. Based on these abstractions, we couple an existing Domain-Specific Language (DSL) back end to the OpenVX environment and provide language constructs to the programmer for the definition of user-defined nodes. In this way, we enable optimizations that are not possible to detect with OpenVX graph implementations using the standard computer vision functions. These optimizations can double the throughput on an Nvidia GTX GPU and decrease the resource usage of a Xilinx Zynq FPGA by 50% for our benchmarks. Finally, we show that our proposed compiler framework, called HipaccVX, can achieve better results than the state-of-the-art approaches Nvidia VisionWorks and Halide-HLS.

preprint2020arXiv

Isolation-Aware Timing Analysis and Design Space Exploration for Predictable and Composable Many-Core Systems

Composable many-core systems enable the independent development and analysis of applications which will be executed on a shared platform where the mix of concurrently executed applications may change dynamically at run time. For each individual application, an off-line Design Space Exploration (DSE) is performed to compute several mapping alternatives on the platform, offering Pareto-optimal trade-offs in terms of real-time guarantees, resource usage, etc. At run time, one mapping is then chosen to launch the application on demand. In this context, to enable an independent analysis of each individual application at design time, so-called inter-application isolation schemes are applied which specify temporal or spatial isolation policies between applications. S.o.t.a. composable many-core systems are developed based on a fixed isolation scheme that is exclusively applied to every resource in every mapping of every application and use a timing analysis tailored to that isolation scheme to derive timing guarantees for each mapping. A fixed isolation scheme, however, heavily restricts the explored space of solutions and can, therefore, lead to suboptimality. Lifting this restriction necessitates a timing analysis that is applicable to mappings with an arbitrary mix of isolation schemes on different resources. To address this issue, we present an isolation-aware timing analysis that unlike existing analyses can handle multiple isolation schemes in combination within one mapping and delivers safe yet tight timing bounds by identifying and excluding interference scenarios that can never happen under the given combination of isolation schemes. Based on the timing analysis, we present a DSE which explores the choices of isolation scheme per resource within each mapping. Experimental results demonstrate the advantage of the proposed approach over approaches based on a fixed isolation scheme.

preprint2020arXiv

Secure Boot from Non-Volatile Memory for Programmable SoC Architectures

In modern embedded systems, the trust in comprehensive security standards all along the product life cycle has become an increasingly important access-to-market requirement. However, these security standards rely on mandatory immunity assumptions such as the integrity and authenticity of an initial system configuration typically loaded from Non-Volatile Memory (NVM). This applies especially to FPGA-based Programmable System-on-Chip (PSoC) architectures, since object codes as well as configuration data easily exceed the capacity of a secure bootROM. In this context, an attacker could try to alter the content of the NVM device in order to manipulate the system. The PSoC therefore relies on the integrity of the NVM particularly at boot-time. In this paper, we propose a methodology for securely booting from an NVM in a potentially unsecure environment by exploiting the reconfigurable logic of the FPGA. Here, the FPGA serves as a secure anchor point by performing required integrity and authenticity verifications prior to the configuration and execution of any user application loaded from the NVM on the PSoC. The proposed secure boot process is based on the following assumptions and steps: 1) The boot configurationis stored on a fully encrypted Secure Digital memory card (SD card) or alternatively Flash acting as NVM. 2) At boot time, a hardware design called Trusted Memory-Interface Unit (TMIU) is loaded to verify first the authenticity of the deployed NVM and then after decryption the integrity of its content. To demonstrate the practicability of our approach, we integrated the methodology into the vendor-specific secure boot process of a Xilinx Zynq PSoC and evaluated the design objectives performance, power and resource costs.