Researcher profile

Mehdi Alipour

Mehdi Alipour contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2022arXiv

Freeway to Memory Level Parallelism in Slice-Out-of-Order Cores

Exploiting memory level parallelism (MLP) is crucial to hide long memory and last level cache access latencies. While out-of-order (OoO) cores, and techniques building on them, are effective at exploiting MLP, they deliver poor energy efficiency due to their complex and energy-hungry hardware. This work revisits slice-out-of-order (sOoO) cores as an energy efficient alternative for MLP exploitation. sOoO cores achieve energy efficiency by constructing and executing \textit{slices} of MLP generating instructions out-of-order only with respect to the rest of instructions; the slices and the remaining instructions, by themselves, execute in-order. However, we observe that existing sOoO cores miss significant MLP opportunities due to their dependence-oblivious in-order slice execution, which causes dependent slices to frequently block MLP generation. To boost MLP generation, we introduce Freeway, a sOoO core based on a new dependence-aware slice execution policy that tracks dependent slices and keeps them from blocking subsequent independent slices and MLP extraction. The proposed core incurs minimal area and power overheads, yet approaches the MLP benefits of fully OoO cores. Our evaluation shows that Freeway delivers 12% better performance than the state-of-the-art sOoO core and is within 7% of the MLP limits of full OoO execution.

preprint2012arXiv

Design Space Exploration to Find the Optimum Cache and Register File Size for Embedded Applications

In the future, embedded processors must process more computation-intensive network applications and internet traffic and packet-processing tasks become heavier and sophisticated. Since the processor performance is severely related to the average memory access delay and also the number of processor registers affects the performance, cache and register file are two major parts in designing embedded processor architecture. Although increasing cache and register file size leads to performance improvement in embedded applications and packet-processing tasks in high traffic networks with too much packets, the increased area, power consumption and memory hierarchy delay are the overheads of these techniques. Therefore, implementing these components in the optimum size is of significant interest in the design of embedded processors. This paper explores the effect of cache and register file size on the processor performance to calculate the optimum size of these components for embedded applications. Experimental results show that although having bigger cache and register file is one of the performance improvement approaches in embedded processors, however, by increasing the size of these parameters over a threshold level, performance improvement is saturated and then, decreased.

preprint2012arXiv

Effect of Thread Level Parallelism on the Performance of Optimum Architecture for Embedded Applications

According to the increasing complexity of network application and internet traffic, network processor as a subset of embedded processors have to process more computation intensive tasks. By scaling down the feature size and emersion of chip multiprocessors (CMP) that are usually multi-thread processors, the performance requirements are somehow guaranteed. As multithread processors are the heir of uni-thread processors and there isn't any general design flow to design a multithread embedded processor, in this paper we perform a comprehensive design space exploration for an optimum uni-thread embedded processor based on the limited area and power budgets. Finally we run multiple threads on this architecture to find out the maximum thread level parallelism (TLP) based on performance per power and area optimum uni-thread architecture.

preprint2012arXiv

Performance-Optimum Superscalar Architecture for Embedded Applications

Embedded applications are widely used in portable devices such as wireless phones, personal digital assistants, laptops, etc. High throughput and real time requirements are especially important in such data-intensive tasks. Therefore, architectures that provide the required performance are the most desirable. On the other hand, processor performance is severely related to the average memory access delay, number of processor registers and also size of the instruction window and superscalar parameters. Therefore, cache, register file and superscalar parameters are the major architectural concerns in designing a superscalar architecture for embedded processors. Although increasing cache and register file size leads to performance improvements in high performance embedded processors, the increased area, power consumption and memory delay are the overheads of these techniques. This paper explores the effect of cache, register file and superscalar parameters on the processor performance to specify the optimum size of these parameters for embedded applications. Experimental results show that although having bigger size of these parameters is one of the performance improvement approaches in embedded processors, however, by increasing the size of some parameters over a threshold value, performance improvement is saturated and especially in cache size, increments over this threshold value decrease the performance.