Framework (OC-Accel), simulation engine (OCSE) and high level language (HLS) University of Geneva September 15<sup>th</sup>, 2021 IBM Montpellier - Application porting at a glance - Coding wo framework - Open-source framework architecture - Ease of coding - Ease of moving - Ease of adapting - FPGA acceleration: a 3 steps process # FPGA development: no framework with HDL # HDL: Hardware Description Langage Develop your code Software side: lib(o)cxl APIs - ■FPGA side: - CAPI PSL interface - □OpenCAPI TLx - ■Your action in HDL # OC-ACCEL: OpenCAPI Acceleration Framework - It is an opensource development environment like SNAP was for CAPI1&2) - Code is at <a href="https://github.com/OpenCAPI/oc-accel">https://github.com/OpenCAPI/oc-accel</a> - Doc is at <a href="https://opencapi.github.io/oc-accel-doc/">https://opencapi.github.io/oc-accel-doc/</a> - POWER Utils tools at : <a href="https://github.com/OpenCAPI/oc-utils">https://github.com/OpenCAPI/oc-utils</a> - How to setup a project - Easy to re-use CAPI1/2 - Ease to change card or setup a new one - How to simulate a project (simple examples) - How to generate the FPGA flash memory content - How to test on Power ### Quick and easy development framework for OpenCAPI Accelerators # **OC-ACCEL** documentation Coherent user-level accelerators and I/O devices Power™ Coherent Acceleration Processor Interface (CAPI) Different examples are provided Each directory has a **/sw** with main calling application and a /hw directory with the action coded either in RTL or in C/C++ We will briefly explore: - The pixel manipulation example - The python example Predefined configuration, avoiding setup mistake « make snap\_config » ## Example of HLS usage ## hw/action\_pixel\_filter.cpp: ``` #pragma HLS INLINE #pragma HLS stream depth=16 variable=in_stream #pragma HLS PIPELINE pixel->red = in_stream.read(); pixel->green = in_stream.read(); pixel->blue = in_stream.read(); } ``` ## hw/action\_pixel\_filter.cpp ``` static void strmInWrite(hls::stream<unsigned char> &in stream, snap membus 512 t } *din gmem, action reg *act reg, uint64 t idx, uint32 t nbPixel ) unsigned char elt[BPERDW 512]; uint32 t nb, done; int i: #pragma HLS INLINE // dataflow = act reg->Data.in.size / BPERDW 512; nb L1: //#pragma HLS P<u>IPELINE</u> for (int_j = 0; j < nb; j ++) rBurstOfDataMem(din gmem, (snapu64 t)idx, elt ); 111: for ( 1 = 0; i < BPERDW 512; i++ ) { #pragma HLS UNROLL factor 64 done = j*BPERDW 512 + i if ( done < nbPixel() in stream.write(elt[i]);</pre> idx++; ``` This is how we prepare the hardware using vivado HLS. Two in/out streams will collect/return the data to the host mem The pixel manipulation is described in C/C++ ## hw/action\_pixel\_filter.cpp ``` static void grayscale(pixel t *pixel in, pixel t *pixel out){ uint8 t gray=(((pixel in->red) * RED FACTOR)>> 8) + (((pixel in->green) * GREEN FACTOR)>> 8) + (((pixel in->blue) * BLUE FACTOR)>> 8); pixel out->red = gray; pixel out->green = gray; pixel out->blue = gray; return; static void transformPixel(pixel t *pixel in add, pixel t *pixel out add) { if (pixel in add->red < pixel in add->green || pixel in add->red < pixel in add->blue) grayscale(pixel in add, pixel out add); return; else pixel out add->red = pixel in add->red; pixel out add->blue = pixel in add->blue; pixel out add->green = pixel in add->green; return; ``` #### castella@hdclf149:.../framework/castella2/oc-image sim\$ make sim is set to: /afs/bb/proi/fpga/xilinx/Vivado/2019.2/bin/vivado is set to: Vivado v2019.2 (64-bit) Vivado version ====Simulation setup: Setting up OCSE version========= =====Simulation setup: Checking path to OCSE======== is set to: "/afs/bb/proj/fpga/framework/castella2/ocse" ====ACTION ROOT setup===================== is set to: "/afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/c ACTION ROOT ls image filter" ====Timing limit for FPGA image build in ps========= TIMING LABLIMIT is set to: "-200' ----Content of snap env.sh----export TIMING LABLIMIT="-200" export ACTION ROOT=\${SNAP ROOT}/actions/hls image filter export OCSE ROOT=/afs/bb/proj/fpga/framework/castella2/ocse == Precompiling the Action logic: hls image filter castella@hdclf149:.../sim/xsim/20200909 2128595 [HW PROJECT.....] start 21:24:25 Wed Sep 09 2020 is set to: is set to: [CONFIG ACTION HW....] start 21:24:25 Wood Compiling action with Vivado HLS Clock period used for HLS is 4 r Checking for critical warnings d Checking for critical timings du Checking for reserved MMIO area [CONFIG ACTION HW....] done 21:25: =====Simulation setup: Setting up 0 =====Simulation setup: Checking pat OC-ACCEL ENVIRONMENT SETUP Path to vivado Vivado version #### Run a simulation #### « make sim » In 5' you can simulate WITH the Host server and the actual memor ``` INFO: [Common 17-1239] XILINX LOCAL USER DATA is set to 'no' export simulation for version=2019.2 patch simulation link to libdpi build xsim model [BUILD xsim MODEL....] done 21:28:59 Wed Sep 09 2020 Suggested next step: to run a simulation, execute: make sim [SIMULATION......] start 21:28:59 Wed Sep 09 2020 SIMULATOR is set to xsim NAP ROOT=/afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/castella2/oc-image sim simulator=xsim simdir=xsim simtop=top capi ver=opencapi30 in sim script subdirectory /afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/castella2/oc-image sim orepare simout directory from pwd=/afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/castella2/oc-in copy default ocse parms copy parms file CAPI VER=opencapi30 parm file= CAPI VER=opencapi30 OCSE_ROOT=/afs/bb/proj/fpga/framework/castella2/ocse SNAP_root= /afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/castella2/oc-image sim simbase= /afs/vlsilab.boeblingen.ibm.com/proj/fpga/framework/castella2/oc-image sim/hardware/sim ``` #### testcase window, use >script stim.log < to log input ``` castella@hdclf149:.../sim/xsim/20200909_212859$ snap_image_filter -i ../../../actions/hls_image_filter/sw/tiger_small_bmp -o tiger_small_sim.bmp input ../../../../actions/hls_image_filter/sw/tiger_small_bmp output tiger_small_sim.bmp Bitmap size: 5070 INFO:Connecting to host 'hdclf149.boeblingen.de.ibm.com' port 16384 elaps time 41706123 micro seconds. INFO:detach response from from ocse castella@hdclf149:.../sim/xsim/20200909_212859$ ll tiger_small_sim.bmp -rw------ 1 castella gloadl 5070 Sep 9 21:33 tiger_small_sim.bmp ``` ## Hardware exchanges & computation analysis Once simulation is performed if required, you can check/debug the exact transmissions with the « ./display\_traces » command ## Card programming Once simulation and chronograms are satisfactory it is time to generate an image with « make image » command This will actually prepare the synthesis of the circuitry. It takes some time And it will provide a binary file (in \$SNAP ROOT/hardware/build/Images/xxx.bin) ready to be stored in the flash memory of the FPGA card ``` ***** xsim v2019.2 (64-bit) **** SW Build 2708876 on Wed Nov 6 21:39:14 MST 2019 **** IP Build 2700528 on Thu Nov 7 00:09:20 MST 2019 ** Copyright 1986-2019 Xilinx, Inc. All Rights Reserved. start qui make castella@hdclf149:.../framework/castella2/oc-image sim$ ma is set to: /afs/bb/proj/fpga/ 10/2019.2/b Path to vivado is set to: Vivado v2019.2 (64 Vivado version =====Simulation setup: Setting up OCSE version======= ====Simulation setup: Checking path to OCSE======== OCSE ROOT is set to: "/afs/bb/proi/fpga/framework/castella2/oc ----ACTION ROOT setup----- ACTION ROOT is set to: "/afs/vlsilab.boeblingen.ibm.com/proj/fpg ls image filter" TIMING LABLIMIT is set to: "-200' ====Content of snap env.sh===================== export TIMING LABLIMIT="-200" ``` ``` https://github.com/OpenCAPI/oc-utils castella@orpington:/home/capiteam/Images/AD9V3_OC/image_filter$_sudo_oc-flash-script_oc_2020_0909_17 28_25G_hls_image_filter_noSDRAM_OC-AD9V3_-72_primary.bin_oc_2020_0909_1728_25G_hls_image_filter_noSD RAM_OC-AD9V3_-72_secondary.bin _____ == OpenCAPI programming tool == ______ oc-flash script version is 2.3 Tool compiled on: Jun 18 14:46 In this server: 1 OpenCAPI card(s) found. Current date is Wed 09 Sep 2020 08:13:59 PM CEST Logs shows that last programming was: Flashed Card Last Image card0:0006:00:00.0 Alphadata9V3(VU3P) Tue 08 Sep 2020 03:57:35 PM C ST mesnet ./Images/AD9V3_OC/oc_2020_0908_1352_25G_hls_memcopy_512_SDRAM_OC-AD9V3_-35_primary.bin_./Image s/AD9V3_OC/oc_2020_0908_1352_25G_hls_memcopy_512_SDRAM_OC-AD9V3_-35_secondary.bin Which card do you want to flash? [0-0] 0 REMINDER: It is safer to CLOSE all JTAG tools (SDK, hardware_manager) before starting programming. You will flash card0 with: oc 2020 0909 1728 25G hls image filter noSDRAM OC-AD9V3 -72 primary.bin and oc_2020_0909_1728_25G_hls_image_filter_noSDRAM_OC-AD9V3_-72_secondary.bin Do you want to continue? [y/n] y Using spi x8 mode Primary bitstream: oc 2020 0909 1728 25G hls image filter noSDRAM OC-AD9V3 -72 primary.bin ! Power™ Coherent Acce OSPI master core setup: completed ``` Once simulation and chronograms are satisfactory, it is time to generate an flash image with **«** *make image* **»** command This will actually prepare the synthesis of the circuitry. It takes some time And it will provide a binary file ready to be stored in the flash memory of the FPGA card. ``` castella@orpington:~/oc-accel-image$ sudo ~/oc-accel/software/tools/oc_find_card -v -AALL [sudo] password for castella: oc_find_card version is 2.4 AD9V3 card has been detected in CAPI card position: 0 PSL Revision is Device ID is : 0x0632 Sub device is : 0x060f Image loaded is self defined as : user Next image to be loaded at next reset (load_image_on_perst) is : user Hardware Card PCI location is : 0030:01:00.0 Virtual Card PCI location is : 0008:00:00.0 Card PCI physical slot is (requires sudo priv) : SLOT0 OC-AD9V3 card has been detected in OPENCAPI card position: 0 Device ID : 0x062b Sub device is : 0x060f Image loaded is self defined as : factory Virtual Card PCI location is : 0006:00:00.1 Card PCI physical slot is : Not Applicable Total 2 cards detected ``` ``` castella@orpington:~/oc-accel-image$ ./actions/hls_image_filter/sw/snap_image_filter -i ./actions/hls_image_filter/sw/tiger.bmp -o ./actions/hls_image_filter/sw/tiger _out.bmp input ./actions/hls_image_filter/sw/tiger.bmp output ./actions/hls_image_filter/sw/tiger_out.bmp Bitmap size: 873234 elaps time 24023 micro seconds. ``` ## **Bandwidth testing** [00000008] 0000202009150921 Build Date: - Each hls\_\*memcopy\_\* actions offers a simple performance test case to run on your P9 hardware - Highlighted we see 17.7 GB/s from host mem to EPGA and more than 20GB/S going from FPGA to | + | PGA and m | |---------------------------------------------------------|-----------| | OC-Accel hls_memcopy_1024 Throughput (MBytes/s) hc | ost mem. | | +LCL stands for DDR or HBM memory according to hardware | + | | +bytes Host->FPGA RAM FPGA RAM->Host FPGA(LCL->BRAM) FPGA(BRAM->LCL) | | | | | |----------------------------------------------------------------------|----------------|----------------|-------------------|----------------------------| | bytes | Host->FPGA_RAM | FPGA_RAM->Host | F'PGA (LCL->BRAM) | FPGA(BRAM->LCL) | | 512 | 8.828 | 10.240 | 10.240 | 11.907 | | 1024 | 23.814 | 20.480 | 1.484 | 1.476 | | 2048 | 3.225 | 2.926 | 2.985 | 2.985 | | 4096 | 5.971 | 6.554 | 6.491 | 80.314 | | 8192 | 11.924 | 6.192 | 6.141 | 6.466 | | 16384 | 12.337 | 12.911 | 12.921 | 12.870 | | 32768 | 24.768 | 24.693 | 24.787 | 25.863 | | 65536 | 49.461 | 95.118 | 92.959 | 102.721 | | 131072 | 204.800 | 188.052 | 188.322 | 97.815 | | 262144 | 195.484 | 203.055 | 195.193 | 202.741 | | 524288 | 404.856 | 399.305 | 380.194 | 383.251 | | 1048576 | 759.838 | 1351.258 | 775.574 | 741.567 | | 2097152 | 1457.368 | 1408.430 | 1402.777 | 1391.607 | | 4194304 | 2720.042 | 4185.932 | 4096.000 | 4096.000 | | 8388608 | 7483.147 | 6732.430 | 6091.945 | 6061.133 | | 16777216 | 7584.637 | 10292.771 | 6193.140 | 6181.730 | | 33554432 | 10525.230 | 13584.790 | 9683.819 | 9703.422 | | 67108864 | 13899.930 | 16615.218 | 10789.206 | 10764.977 | | 134217728 | 17563.168 | 16927.447 | 11443.237 | 11411.131 | | 268435456 | 17688.156 | 20650.470 | 11786.409 | 11749.265 <mark>CAP</mark> | #### Note: - Make sure ensure you have the OpenCAPI link attached to the core where the software is executed. - Use numactl to control this - Using SWIG, CURL and pip3 to ensure environment is controlled - FPGA contains the hello\_world\_1024 binary (Helloworld HLS (C/C++) description reused) - Host memory is accessed by the python, which in turn exchanges with the hardware through the OpenCAPI interface - Can run in a Jupyter notebook https://github.com/OpenCAPI/oc-accel/tree/master/actions/hls\_helloworld\_python # The CAPI SNAP/OC-Accel concept # 2 different working modes The Job-Queue Mode SERIAL MODE FPGA-action executes a job and returns after completion The Fixed-Action Mode PARALLEL MODE FPGA-action is designed to permanently run Data-streaming approach with data-in and ### **Presentation Outline** - Application porting at a glance - Coding wo framework - Open-source framework architecture - Ease of coding - Ease of moving - Ease of adapting - FPGA acceleration: a 3 steps process 1 EXA EXAMPLE SNAP\_CONFIG=**CPU** snap\_helloworld –i /tmp/t1 -o /tmp/t2 x86 server command: make snap\_config SNAP\_CONFIG=**FPGA**snap\_helloworld -i /tmp/t1 -o /tmp/t2 command: make sim **3** EXECUTION SNAP\_CONFIG=**FPGA**snap\_helloworld –i /tmp/t1 –o /tmp/t2 command: make image Power™ Coherent Acceleration Processor Interface (CAPI) - CAPI / OPENCAPI removes the driver latency that a classic "FPGA + drivers" adds - **HLS** can be easily tuned to get performances as good as low level language - SNAP / OC-ACCEL follow the CAPI / OpenCAPI and FPGAs evolution without a change in user's code - Open-source helps integration with other software (libfuse...) and motivate new IPs/projects coded based on SNAP and CAPI/OpenCAPI - Complex C/C++ codes (3000 lines) can be used for FPGA programming - CAPI / OpenCAPI Simulation Engines save huge time for debuging - Know more about accelerators? - See a live demonstration? - Do a benchmark? - Get answers to your questions? # Contact us <u>alexandre.castellane@fr.ibm.com</u> <u>bruno.mesnet@fr.ibm.com</u> <u>fabrice\_moyen@fr.ibm.com</u> OpenCAPI Consortium: <a href="https://www.opencapi.org">https://www.opencapi.org</a> OpenCAPI Repository: <a href="https://github.com/OpenCAPI">https://github.com/OpenCAPI</a> OC-Accel Documentation: <a href="https://opencapi.github.io/oc-accel-doc/">https://opencapi.github.io/oc-accel-doc/</a>