Source author record

Bulent Abali

Bulent Abali appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Hardware Architecture Performance Artificial Intelligence Databases eess.SY Machine Learning Systems and Control

Catalog footprint

What is connected

3works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

EFloat: Entropy-coded Floating Point Format for Compressing Vector Embedding Models

In a large class of deep learning models, including vector embedding models such as word and database embeddings, we observe that floating point exponent values cluster around a few unique values, permitting entropy based data compression. Entropy coding compresses fixed-length values with variable-length codes, encoding most probable values with fewer bits. We propose the EFloat compressed floating point number format that uses a variable field boundary between the exponent and significand fields. EFloat uses entropy coding on exponent values and signs to minimize the average width of the exponent and sign fields, while preserving the original FP32 exponent range unchanged. Saved bits become part of the significand field increasing the EFloat numeric precision by 4.3 bits on average compared to other reduced-precision floating point formats. EFloat makes 8-bit and even smaller floats practical without sacrificing the exponent range of a 32-bit floating point representation. We currently use the EFloat format for saving memory capacity and bandwidth consumption of large vector embedding models such as those used for database embeddings. Using the RMS error as metric, we demonstrate that EFloat provides higher accuracy than other floating point formats with equal bit budget. The EF12 format with 12-bit budget has less end-to-end application error than the 16-bit BFloat16. EF16 with 16-bit budget has an RMS-error 17 to 35 times less than BF16 RMS-error for a diverse set of embedding models. When making similarity and dissimilarity queries, using the NDCG ranking metric, EFloat matches the result quality of prior floating point representations with larger bit budgets.

preprint2019arXiv

Touché: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads

Compression is seen as a simple technique to increase the effective cache capacity. Unfortunately, compression techniques either incur tag area overheads or restrict data placement to only include neighboring compressed cache blocks to mitigate tag area overheads. Ideally, we should be able to place arbitrary compressed cache blocks without any placement restrictions and tag area overheads. This paper proposes Touché, a framework that enables storing multiple arbitrary compressed cache blocks within a physical cacheline without any tag area overheads. The Touché framework consists of three components. The first component, called the ``Signature'' (SIGN) engine, creates shortened signatures from the tag addresses of compressed blocks. Due to this, the SIGN engine can store multiple signatures in each tag entry. On a cache access, the physical cacheline is accessed only if there is a signature match (which has a negligible probability of false positive). The second component, called the ``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with data. TADA enables Touché to detect false positive signature matches by ensuring that the actual tag address is available for comparison. The third component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique marker in the tag entry to indicate the occurrence of compressed cache blocks from neighboring physical addresses in the same cacheline. Touché is completely hardware-based and achieves an average speedup of 12\% (ideal 13\%) when compared to an uncompressed baseline.

preprint2015arXiv

Disaggregated and optically interconnected memory: when will it be cost effective?

The "Disaggregated Server" concept has been proposed for datacenters where the same type server resources are aggregated in their respective pools, for example a compute pool, memory pool, network pool, and a storage pool. Each server is constructed dynamically by allocating the right amount of resources from these pools according to the workload's requirements. Modularity, higher packaging and cooling efficiencies, and higher resource utilization are among the suggested benefits. With the emergence of very large datacenters, "clouds" containing tens of thousands of servers, datacenter efficiency has become an important topic. Few computer chip and systems vendors are working on and making frequent announcements on silicon photonics and disaggregated memory systems. In this paper we study the trade-off between cost and performance of building a disaggregated memory system where DRAM modules in the datacenter are pooled, for example in memory-only chassis and racks. The compute pool and the memory pool are interconnected by an optical interconnect to overcome the distance and bandwidth issues of electrical fabrics. We construct a simple cost model that includes the cost of latency, cost of bandwidth and the savings expected from a disaggregated memory system. We then identify the level at which a disaggregated memory system becomes cost competitive with a traditional direct attached memory system. Our analysis shows that a rack-scale disaggregated memory system will have a non-trivial performance penalty, and at the datacenter scale the penalty is impractically high, and the optical interconnect costs are at least a factor of 10 more expensive than where they should be when compared to the traditional direct attached memory systems.