Zero Delay Loop
From Open64 Wiki
Contents |
Abstract
This ZDL contribution is a co-operational approach of global scalar optimizer WOPT and the Code Generation component CG. The common ZDL features, i.e, the loop counting mechanism hiding and bottom loop condition abstracted are implemented in WOPT. The changes in WOPT are minimized while keeping the maximum optimizing result. A pseudo branch op method is also introduced to expand ZDL WHIRL operator and delay the ZDL hardware instruction generation to the very last phase of CG loop optimisation. This lazy implementation helps keeping the compatibility with other CG loop optimizations to produce the most optimal code. It is believed with the new ZDL implementation, open64 provides the equal expressibility to gcc's doloop_begin, doloop_end and decrement_and_branch_until_zero patterns.
ZDL introduction
Zero-Delay-Loop(also called Zero-Overhead-Loop) is a commonly used DSP feature which efficiently implements the "do" loops. Before entering the loop segment, registers specifying the number of times(L) the loop, the beginning(START_PC) and end(END_PC) of the loop are set up. Then, without explicit branches embedded in the code, the loop body is executed the L times before proceeding to the next segment. Typically there is loop counter which is initialized to L and is decremented with each execution of the loop. The zero delay comes form the fact that the branch instruction and another instruction used to increment/decrement the counter are eliminated, resulting in fewer number of dynamic instructions and branch misses are reduced to "zero". A typical ZDL code can be shown below(SL target code as an example):
mvtc.i $lp0,16 # [2]
add.i $3,$3,%lo(s) # [3]
add.i $2,$2,1 # [4]
loop 0, .tag_0_2562 # [5]
.Lt_0_2050:
#<loop> Loop body line 2, nesting depth: 1, iterations: 16
.loc 1 5 0
# 3 int i;
# 4 for(i=0;i<16;i++){
# 5 s[i]=2*j+1;
stw16 $2,$3 # [0] id:13 s+0x0
.loc 1 4 0
add16.i $3,4 # [0]
.loc 1 6 0
# 6 j++;
.tag_0_2562:
add.i $2,$2,2 # [1]
where lp0 as the trip-count register(L), loop as the ZDL trigger instruction and tag 0 2562 as the ending PC.
ZDL BR
To do ZDL, we can transform the loop:
while(k3>=10){
sum+=k1;
k3 --;
}
into the form:
zdl_loop(k3-9) {
sum+=k1;
}
So, we introduce a new ZDLBR operator, which represents the loop in whirl as:
LABEL L2050 0 {line: 0}
LOOP_INFO 0 1 1
I4I4LDID 73 <1,2,.preg_I4> T<4,.predef_I4,4> # k3
I4I4LDID 77 <1,2,.preg_I4> T<4,.predef_I4,4> # <preg>
END_LOOP_INFO
I4I4LDID 74 <1,2,.preg_I4> T<4,.predef_I4,4> # k1
I4I4LDID 75 <1,2,.preg_I4> T<4,.predef_I4,4> # sum
I4ADD
I4STID 75 <1,2,.preg_I4> T<4,.predef_I4,4> # sum {line: 5}
ZDLBR L2050 {line: 0}
Deleting the Induction Variable k3 is done by WOPT for the fact of ZDL abstraction of trip-counting hidden, at the same time to relieve the burden of CG.
Semantics for ZDLBR
A specification for the OPR_ZDLBR is proposed as:
OPR_ZDLBR[M-VL]:
- A non-structured branch operation similar to OPR_FALSEBR/OPR_TRUEBR, which specifies a label to branch to conditionally, but the label must occur before this instruction and must contain the LOOP INFO structure. ZDLBR's branch condition is implicitly specified by the LOOP INFO associated with the label. Thus, the counting mechanism and branch condition are hidden and cannot be optimized. This operator can be used to indicate a zero-delay loop for targets that provide such looping instructions.
Lowering Specification:
- OPR_ZDLBR occurs at the M level WHIRL as a control flow operator, it is either produced by the global scalar optimizer WOPT or by lowering specific form of DO_LOOP operator. There are no special handling when OPR_ZDLBR is lowered to L and VL level WHIRL.
Design and Implementation
In wopt, a new ”ZDL TRANSFORMATION”phase is added to the very last of main opt phases. Since ZDL is quite hardware oriented stuff, it is designed to put in the main-opt. Let other optimizations go first to make sure the minimal changes and maximum optimized code.
Kernel of the phase is quite simple:
BB_NODE *do_end= bb_loop->End(); BB_NODE *do_body = bb_loop->Body(); do_end->Remove_stmtrep(do_end->Branch_stmtrep()); STMTREP *hwbr = CXX_NEW (STMTREP(OPC_ZDLBR), _cfg->Mem_pool()); hwbr->Set_label_number(do_body->Labnam()); do_end->Append_stmtrep(hwbr);
and:
zdl.Identify_zdl_loops();
if (zdl.Dump()) {
zdl.Print_zdl_loops("Before ZDL Change");
}
zdl.Change_zdl_loops();
Do_dead_code_elim(FALSE,FALSE,FALSE,FALSE,FALSE,FALSE,NULL);
if (zdl.Dump()) {
zdl.Print_zdl_loops("After ZDL Change");
}
As for the CG part, introducing a new pseudo op "auxbr" is suggested:
void Target_Specific_ZDLBR_Expansion(TN* target_tn) {
Build_OP (TOP_auxbr, target_tn, &New_OPs);
}
With right linking auxbr operator to the loop head label, we identify the zdl loop in cg expansion phase and delay the zdl implementation to the cg loop optimization phase after unroll performed. So, we believe the new ZDL implementation does not break the good interaction between WOPT and CG.
In cg loop's final handling, we add:
void Perform_Loop_Optimizations(){
...
#ifdef ZDL_TARGET
{
Free_Dominators_Memory ();
Calculate_Dominators();
/* needed for loop recognition */
LOOP_DESCR_Detect_Loops(&loop_descr_pool);
LOOP_DESCR_Create_Loop_Tree(&loop_descr_pool);
if(trace_general){
LOOP_DESCR_Dump_Loop_Tree();
}
/* here is the main interface for target-depend ZDL Gen */
extern INT CG_LOOP_ZDL_Gen(LOOP_DESCR*);
for( int i = 0; i < VECTOR_count(loop_tree_roots); i++ ) {
CG_LOOP_ZDL_Gen( (LOOP_DESCR*)VECTOR_element(loop_tree_roots,i) );
}
}
#endif
MEM_POOL_Pop (&loop_descr_pool);
MEM_POOL_Delete(&loop_descr_pool);
Free_Dominators_Memory ();
}
The interface CG_LOOP_ZDL_Gen does the really ZDL GEN work. This is where the target specific hardware operators introduced, and highly specialized check and generator comes in. Another way to generate ZDL is by lowering MSTORE, so we have an update for WN lower.cxx. In function copy aggregate loop const, we have the following ZDL handling.
@@ -8925,8 +8926,31 @@
srcPreg, dstPreg, origLoad, origStore,
copy_alignment, actions );
if ( OPT_Lower_ZDL ) {
WN *infoblock = WN_CreateBlock();
WN *do_loop_info = WN_CreateLoopInfo(WN_CreateIdname(n, intPreg),
WN_CreateIntconst(OPC_I4INTCONST, nMoves),
0,
loop_nest_depth+1,
WN_LOOP_INNERMOST);
do_loop_info = lower_expr(infoblock, do_loop_info, actions);
WN_Set_Loop_Nz_Trip(do_loop_info);
WN_DELETE_Tree(infoblock);
doLoop = WN_CreateDO ( WN_CreateIdname(n, intPreg),
start,
NULL,
incr,
body,
do_loop_info);
traceZDL = Get_Trace(TP_LOWER, 0x400);
if (traceZDL)
fprintf(TFile,
"lower zdl, PU:%s:%s, trip_count:%lld\n",
Src_File_Name,
ST_name(Get_Current_PU_ST()),
nMoves);
} else {
doLoop = WN_CreateDO ( WN_CreateIdname(n, intPreg),
start,
test,
incr,
body,
NULL );
}
We identify those ZDL in copy aggregate loop const with the DO_LOOP NULL condition, then we have special lowering to ZDL in lower do loop.
Essential Code Changes
In this section, mainly code changes to support ZDL is outlined. We have tried to minimize the code changes to support ZDL. We don't have essential algorithm update for WOPT, only inserted the OPR ZDLBR handling cases accordingly. We do a port from ML WHIRL EMITTER::Build loop info to our ZDL::Has Trip Count, this is for the safe transformation of those CG recognizable trip-counted loops.
WOPT Changes
in opt_dce.cxx
@@ -2153,7 +2157,9 @@ return TRUE; #ifdef KEY // bugs 5401 and 5267 - if (OPERATOR_is_scalar_store(opr) && + if ( + _opt_phase != MAINOPT_PHASE && + OPERATOR_is_scalar_store(opr) && + Opt_stab()->Aux_stab_entry(stmt->Lhs()->Aux_id())->Mp_no_dse()) return TRUE; #endif
We add the statement"opt phase != MAINOPT PHASE", since Mp_no_dse which should not be exist in MainOpt is shared with AUX_NO_SPRE.
In opt_dce.cxx
@@ -2814,8 +2821,10 @@
// back out during PRE
// NOTE: this also picks up the Trip_count_stmt().
BB_NODE *dohead = loop->Start();
- if ( (dohead && loop->Is_flag_set(LOOP_DO)
- && dohead->Kind() == BB_DOHEAD) ||
- (dohead && loop->Is_flag_set(LOOP_PRE_DO)
&& dohead->Kind() == BB_DOSTART) ) {
+ if ( (dohead && dohead->Kind() == BB_DOHEAD &&
+ (loop->Is_flag_set(LOOP_DO)
+ || loop->Is_flag_set(LOOP_WHILE)
+ || (loop->Is_flag_set(LOOP_REPEAT))))
+ ||
+ (dohead && loop->Is_flag_set(LOOP_PRE_DO)
+ && dohead->Kind() == BB_DOSTART)) {
obviously, we missed WHILE and REPEAT loops in keeping the EVAL for trip-count.
CG changes
we should keep the bb contains loop info label the distinct bb even it is empty. This is essential since we have to make each loop (head label) distinct , and it is used for reverse back from zdl loop to normal loops. So,
there are changes in whirl2ops.cxx and cflow.cxx.
in expand label(whirl2ops.cxx):
#ifdef KEY BOOL is_non_local_label; - bb = Add_Label(Get_WN_Label (stmt, &is_non_local_label)); + + if (info) + bb = Add_Label_With_Forced_New_BB( + Get_WN_Label (stmt, &is_non_local_label)); + else if (BB_length(Cur_BB) == 0 + && ANNOT_Get (BB_annotations(Cur_BB), ANNOT_LOOPINFO)) + bb = Add_Label_With_Forced_New_BB( + Get_WN_Label (stmt, &is_non_local_label)); + else + bb = Add_Label(Get_WN_Label (stmt, &is_non_local_label)); +
and in collapseempty bb, we prevent the merging of empty bbs containing loop info:
@@ -2294,6 +2298,8 @@
&& !BB_Has_Exc_Label(new_targ)
&& BBINFO_nsuccs(new_targ)
&& BBINFO_succ_bb(new_targ, 0) != new_targ
+ && !ANNOT_Get(BB_annotations(new_targ), ANNOT_LOOPINFO)
+ && !ANNOT_Get(BB_annotations(BBINFO_succ_bb(new_targ,0)),
+ ANNOT_LOOPINFO)
) {
BB *last_targ;
if (freqs_computed || BB_freq_fb_based(new_targ)) {
We believe control will not fall into these changes if they don’t do zdl, and even fall through it, this should be ok, leaving more empty bbs.
Porting
We find the new zdl implementation will at least help Itanium which also has a similar ZDL arch(br.cloop). If you want to support ZDL in your target, maybe you find the following helps you:
WOPT_ZDL and LOWER_ZDL
Firstly, you should enable WOPT ZDL and LOWER ZDL in your target init script, i.e, common/com/YOUR TARGET/config targ.cxx.
+ extern BOOL WOPT_Enable_ZDL;
+ extern BOOL WOPT_Enable_ZDL_Set;
+ extern BOOL OPT_Lower_ZDL;
+ extern BOOL OPT_Lower_ZDL_Set;
+ if ( (Target == TARGET_sl1_pcore || Target == TARGET_sl1_dsp) &&
+ Opt_Level > 1)
+ {
+ if (!WOPT_Enable_ZDL_Set)
+ WOPT_Enable_ZDL = TRUE;
+ if (!OPT_Lower_ZDL_Set)
+ OPT_Lower_ZDL = TRUE;
+ }
+
Now, you are able to see ZDLBR in your wopt output.
TOP_axubr
After that, you add a pseudo operator like TOP auxbr in your machine description(common/targ info/isa/YOUR TARGET/isa *.cxx). This pseudo operator better has condition properties and control transfer properties, and
of course, it is a pseudo operator.
+ case OPC_ZDLBR: + target_tn = Gen_Label_TN (Get_WN_Label (branch), 0); + Target_Specific_ZDLBR_Expansion(target_tn); + break;
You should reimplement the interface Target Specific ZDLBR Expansion(TN* target tn). The suggested way is as below:
void Target_Specific_ZDLBR_Expansion(TN* target_tn) {
Build_OP (TOP_auxbr, target_tn, &New_OPs);
}
ZDL_TARGET
You need enable macro ZDL TARGET in your osprey/be/cg/Makefile.gbase:
# SL is a ZDL target TARGDEFS += -DZDL_TARG
It opens the zdl handling interface in cg loop.cxx i.e,
#ifdef ZDL_TARG
{
Free_Dominators_Memory ();
...
extern INT CG_LOOP_ZDL_Gen(LOOP_DESCR*);
for( int i = 0; i < VECTOR_count(loop_tree_roots); i++ ) {
CG_LOOP_ZDL_Gen( (LOOP_DESCR*)VECTOR_element(loop_tree_roots,i) );
}
}
#endif
CG_LOOP_ZDL_Gen
Finally, implement the CG_LOOP_ZDL_Gen(LOOP_DESCR*) interface in a CG target directory file, do
- replace the TOP auxbr with real ZDL branch operators
- setup the ZDL environment
- roll back from the ZDL to normal loops
- ......
You get ZDL.
Body Validity Check
When you want to make sure the generated ZDLs are right, you can add ZDL body check in the CGEMIT phase. Just implement the interface of Emit_Phase_Valid_Check defined in cgemit.cxx.
Special Handle
You may also want input to cflow.cxx, for some special handling between the loop prolog and loop head. If you find the trip count tn is clobbered by findloops.cxx, you can do something like the following to recover it.
if (!trip_count_tn) { //Early exit
LOOPINFO *loop_info=ANNOT_loopinfo(ANNOT_Get(BB_annotations(head), ANNOT_LOOPINFO));
WN *trip_wn = WN_loop_trip(LOOPINFO_wn(loop_info));
Is_True(trip_wn, ("trip_wn is NULL"));
trip_count_tn = Expand_Expr(trip_wn, NULL, NULL);
Restore_ZDL(prolog,tail,trip_count_tn);
return zdl_level;
}
We also provide support for ZDL targets with early exit, innermost only. Just enable the flagsWOPT Enable ZDL Early Exit andWOPT ZDL Innermost Only to support. But you should be aware that the body has big aggregate structures to move, i.e, big constant MSTORE, more one level ZDL will be generated inside the innermost loop detected by wopt. For example:
typedef struct T { int a[170]; int b; } _t;
_t s[18];
_t t[18];
void f(void){
int i;
for (i=0; i<18; i++) {
t[i]=s[i];
}
}
You still need to handle this case in CG.
NO_ZDL_PRAGMA
If you don’t want some ZDL loops to be ZDLed, we provide the mechanism to disable ZDL gen through pragma no zdl.
For example:
extern int size(void);
extern int s;
extern int s2;
void f(void){
int i,j;
for(i=size();i<0;i++){
#pragma zdl off
for(j=0;j<s2;j++){
#pragma zdl off
s+=i+j;
}
}
}
Test and Issues
The ZDL patch has been well tested including regression cases, application build and run on SL target. We have also tested the option of innermost only and early exit. For the feedback directed optimization, we have only tested typical integer applications with fb phase=0. All the tests show the approach is OK.
We have not tested ZDL generation on the case of unrolling loops with no trip-count. Since ZDL targets are often DSP chips with very limited budget of code-size. So, we have not test this corner case. If there are indeed such requirement, we will make a more deeper investigation.