Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

Beretta, I., Rana, V., Akin, A., Nacci, A. A., Sciuto, D. and Atienza, D. 2016. Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices. ACM Transactions on Embedded Computing Systems (TECS) . 15 (3), p. Article No. 44 44. https://doi.org/10.1145/2851497

Title	Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices
Authors	Beretta, I., Rana, V., Akin, A., Nacci, A. A., Sciuto, D. and Atienza, D.
Abstract	The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs), which are often employed to support the tasks of General Purpose Processors (GPP). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability of performing massive parallel computation. However, in order to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which is in general a very challenging task. This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs, but barely improve the performance that can be obtained with a powerful GPP. In this paper, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones.
Keywords	Optical flow
	TV-L1 Algorithm
	FPGA
	Parallel Architectures
	Custom Hardware
Article number	44
Journal	ACM Transactions on Embedded Computing Systems (TECS)
Journal citation	15 (3), p. Article No. 44
ISSN	1539-9087
	1558-3465
Year	2016
Publisher	ACM
Accepted author manuscript	chambolle-1.pdf
Digital Object Identifier (DOI)	https://doi.org/10.1145/2851497
Publication dates
Published	21 Jul 2016

Title

Authors

Beretta, I., Rana, V., Akin, A., Nacci, A. A., Sciuto, D. and Atienza, D.

Abstract

The performance and the efficiency of recent computing platforms have been deeply influenced by the widespread adoption of hardware accelerators, such as Graphics Processing Units (GPUs) or Field Programmable Gate Arrays (FPGAs), which are often employed to support the tasks of General Purpose Processors (GPP). One of the main advantages of these accelerators over their sequential counterparts (GPPs) is their ability of performing massive parallel computation. However, in order to exploit this competitive edge, it is necessary to extract the parallelism from the target algorithm to be executed, which is in general a very challenging task.

This concept is demonstrated, for instance, by the poor performance achieved on relevant multimedia algorithms, such as Chambolle, which is a well-known algorithm employed for the optical flow estimation. The implementations of this algorithm that can be found in the state of the art are generally based on GPUs, but barely improve the performance that can be obtained with a powerful GPP. In this paper, we propose a novel approach to extract the parallelism from computation-intensive multimedia algorithms, which includes an analysis of their dependency schema and an assessment of their data reuse. We then perform a thorough analysis of the Chambolle algorithm, providing a formal proof of its inner data dependencies and locality properties. Then, we exploit the considerations drawn from this analysis by proposing an architectural template that takes advantage of the fine-grained parallelism of FPGA devices. Moreover, since the proposed template can be instantiated with different parameters, we also propose a design metric, the expansion rate, to help the designer in the estimation of the efficiency and performance of the different instances, making it possible to select the right one before the implementation phase. We finally show, by means of experimental results, how the proposed analysis and parallelization approach leads to the design of efficient and high-performance FPGA-based implementations that are orders of magnitude faster than the state-of-the-art ones.

Keywords

Optical flow

TV-L1 Algorithm

FPGA

Parallel Architectures

Custom Hardware

Article number

Journal

ACM Transactions on Embedded Computing Systems (TECS)

Journal citation

15 (3), p. Article No. 44

ISSN

1539-9087

1558-3465

Year

2016

Publisher

ACM

Accepted author manuscript

chambolle-1.pdf

Digital Object Identifier (DOI)

https://doi.org/10.1145/2851497

Publication dates

Published

21 Jul 2016

Parallelizing the Chambolle Algorithm for Performance-Optimized Mapping on FPGA Devices

Related outputs

Share this

Usage statistics

Export as