Porting of programs from VAX to AXP. ==================================== I) To get optimal performance from the AXP architecture. II) Features/problems in the new compilers on AXP. III) Porting problems VAX -> AXP. I) To get optimal performance from the AXP architecture. ========================================================= A) proper alignment of data B) choosing algorithms to maximize benefit from cache C) choosing programming style to allow maximal pipelining II) Features/problems in the new compilers on AXP. =================================================== A) C B) FORTRAN C) PASCAL III) Porting problems VAX -> AXP. ================================== A) problems with system services B) problems with macro32 C) architectural problems D) problems with undocumented features E) MACRO64 assembler F) VEST Illustration of the importance of proper data alignment and cache. ================================================================== Performance chart: VAX 4200 DEC 3400 (rated at 5 VUPS) (rated at 110 Specmark) aligned not aligned aligned not aligned 4 byte 4 byte 8 byte 8 byte small data 1095 365 26665 2300 (64KB) medium data 1040 350 10800 1200 (3.2MB) large data 440 240 7700 1160 (16MB) Numbers=number of REAL*8 initialized pr. millisecond CPU time (not real time). Conclusions: - REAL*8 must be longword aligned on VAX but quadword aligned on AXP - performance increase with correct alignment is a factor 2-3 on VAX but a factor 5-10 on AXP - the performance increase in keeping the data in cache on AXP is a factor 2 TEST_ALIGN.FOR ============== PROGRAM TEST_ALIGN INTEGER*4 I,J,K INTEGER*4 SIZ(3) BYTE DUMMY(9) REAL*8 X(2000001) EQUIVALENCE (DUMMY,X) DATA SIZ/8000,400000,2000000/ DO 400 K=1,3 DO 200 J=1,100 DO 100 I=1,SIZ(K)+1 X(I)=I 100 CONTINUE CALL FOOL(X) 200 CONTINUE DO 300 I=1,9 CALL TEST(DUMMY(I),SIZ(K)) 300 CONTINUE 400 CONTINUE END C SUBROUTINE TEST(X,SIZ) INTEGER*4 SIZ REAL*8 X(*) INTEGER*4 I,J,T INTEGER*4 PAS$CLOCK2 T=PAS$CLOCK2() DO 200 J=1,100 DO 100 I=1,SIZ X(I)=I 100 CONTINUE CALL FOOL(X) 200 CONTINUE WRITE(6,'(1X,F10.3)') (100.0*SIZ)/(PAS$CLOCK2()-T) RETURN END C SUBROUTINE FOOL RETURN END Compilers and alignment. ======================== FORTRAN ------- (FORTRAN 6.1 for AXP) The alignment within COMMON blocks and STRUCTUREs are determined by the ALIGNMENT quailfier (default is /ALIGNMENT=(COMMONS=PACKED,RECORDS=NATURAL)). The WARNING qualifier controls whether warnings are issued for non natural aligned variables in COMMON blocks (default is /WARNING=ALIGNMENT). Other variables are always natural aligned. PASCAL ------ (PASCAL 5.0 for AXP) The alignment within RECORDs are determined by the ALIGNMENT qualifier (default is /ALIGNMENT=ALPHA_AXP, which use natural alignment - on VAX the default is /ALIGNMENT=VAX, which use packed alignment). Other variables are always natural aligned unless another alignment is explicit defined in the code as an attribute for the variable. C - (DEC C 1.3 for AXP) The alignment within STRUCTs are determined by the MEMBER_ALIGNMENT qualifier (default is /MEMMBER_ALIGNMENT, which use natural alignment - on VAX the default is /NOMEMBER_ALIGN, which use packed alignment). The qualifier can be override temporarrily with the #pragma member_alignment and #pragma nomember_alignment directives. Other variables are always natural aligned. Points ------ 1) Natural alignment is necesarry for optimal performance (see previous example). 2) The AXP compilers "like" natural alignment, while the VAX compilers "like" packed alignment, which makes identical records have non-identical sizes on the two architectures. 3) Be carefull when calling f.ex. MACRO32 routines with natural aligned records. ALIGN_SIZE.C ============ #include struct example {char a; double b;}; main() { struct example test; printf("%d\n",sizeof(test)); } Maximizing benefit from pipelining. =================================== The AXP are a highly pipelined architecture meaning that it benefits greatly from optimization techniques such as loop unrolling. Example: VAX 4200 DEC 3400 (rated at 5 VUPS) (rated at 110 Specmark) /opt=unrol:1 /opt=unroll:5 FORTRAN standard loop (1) 219820 8360 5350 FORTRAN unrolled loop (5) 163430 5340 4710 assembler unrolled loop 157490 4450 4490 numbers=millisecond CPU time used for 100000 calls to dot poduct of 1000 elements array Conclusions: - the benefit of a 1->5 loop unrolling is a saving of 25% on VAX and a saving of 35% on AXP - the AXP compiler is just as good to unroll as the programmer, so manual loop unrolling on AXP are a waste of time - both on VAX and AXP are handwritten assembler code still faster (not much 5% on VAX and 15% on AXP in this example) Compiler defaults for loop unrolling: VAX AXP FORTRAN no 0 = depends on the code, often 4 C no 0 = depends on the code, often 4 PASCAL no no LOOP_UNROLL.FOR =============== PROGRAM LOOP_UNROLL INTEGER*4 N PARAMETER (N=1003) INTEGER*4 I,T REAL*8 X(N),SUM INTEGER*4 PAS$CLOCK2 REAL*8 DOT1,DOT5,DOTASM DO 100 I=1,N X(I)=I 100 CONTINUE T=PAS$CLOCK2() DO 200 I=1,100000 SUM=DOT1(N,X) 200 CONTINUE WRITE(*,*) '1 :',SUM,PAS$CLOCK2()-T T=PAS$CLOCK2() DO 300 I=1,100000 SUM=DOT5(N,X) 300 CONTINUE WRITE(*,*) '5 :',SUM,PAS$CLOCK2()-T T=PAS$CLOCK2() DO 400 I=1,100000 SUM=DOTASM(N,X) 400 CONTINUE WRITE(*,*) 'ASM :',SUM,PAS$CLOCK2()-T END C REAL*8 FUNCTION DOT1(N,X) INTEGER*4 N REAL*8 X(*) INTEGER*4 I DOT1=0 DO 100 I=1,N DOT1=DOT1+X(I)*X(I) 100 CONTINUE RETURN END C REAL*8 FUNCTION DOT5(N,X) INTEGER*4 N REAL*8 X(*) INTEGER*4 I DOT5=0 DO 100 I=1,MOD(N,5) DOT5=DOT5+X(I)*X(I) 100 CONTINUE DO 200 I=MOD(N,5)+1,N,5 DOT5=DOT5+X(I)*X(I)+ + X(I+1)*X(I+1)+ + X(I+2)*X(I+2)+ + X(I+3)*X(I+3)+ + X(I+4)*X(I+4) 200 CONTINUE RETURN END DOTASM.MAR ========== .title dotasm .psect $CODE quad,pic,con,lcl,shr,exe,nowrt .entry dotasm,^m movl @4(ap),r2 ; r2=N movl 8(ap),r3 ; r3=X clrd r0 ; r0=DOTASM divl3 #5,r2,r4 mull3 #5,r4,r4 subl3 r4,r2,r4 ; r4=MOD(N,5) mull2 #8,r4 addl2 r3,r4 ; r4=X[MOD(N,5)+1] mull3 #8,r2,r5 addl2 r3,r5 ; r5=X[N+1] cmpl r3,r4 bgeq 200$ 100$: muld3 (r3),(r3),r6 addd2 r6,r0 addl2 #8,r3 cmpl r3,r4 blss 100$ 200$: cmpl r3,r5 bgeq 400$ 300$: muld3 (r3),(r3),r6 addd2 r6,r0 muld3 8(r3),8(r3),r6 addd2 r6,r0 muld3 16(r3),16(r3),r6 addd2 r6,r0 muld3 24(r3),24(r3),r6 addd2 r6,r0 muld3 32(r3),32(r3),r6 addd2 r6,r0 addl2 #40,r3 cmpl r3,r5 blss 300$ 400$: ret .end DOTASM.M64 ========== $routine DOTASM,kind=stack,saved_regs= $linkage_section c8: .long 8 $code_section ldl r22,(r16) ; r22=N mov r17,r23 ; r23=X .base r27,$ls $call OTS$REM_I,args=<(r16)/l,c8/l> mull r0,8,r2 addl r2,r23,r3 ; r3=X[MOD(N,5)+1] mull r22,8,r22 addl r22,r23,r24 ; r24=X[N+1] fmov f31,f0 ; f0=DOTASM subl r3,r23,r1 ble r1,200$ 100$: ldg f10,(r23) mulg f10,f10,f22 addg f22,f0,f0 addl r23,8,r23 subl r3,r23,r1 bgt r1,100$ 200$: subl r24,r23,r1 ble r1,400$ 300$: ldg f10,(r23) ldg f11,8(r23) ldg f12,16(r23) ldg f13,24(r23) ldg f14,32(r23) addl r23,40,r23 mulg f10,f10,f22 mulg f11,f11,f23 mulg f12,f12,f24 mulg f13,f13,f25 mulg f14,f14,f26 subl r24,r23,r1 addg f22,f23,f27 addg f24,f25,f28 addg f26,f0,f0 addg f27,f28,f29 addg f29,f0,f0 bgt r1,300$ 400$: $return $end_routine DOTASM .end Porting from VAX C to DEC C. ============================ DEC C is fully ANSI compatible, which VAX C is not ! There are a lot of differences ! 1) Much stricter syntax-check even with the default /STANDARD=RELAXED_ANSI89 (and guess what happend with /STANDARD=ANSI89). You will have to make a lot of small modifications (f.ex. type casts). And if we want to see all the warnings about sloppy code, then the qualifier /WARNING=ENABLE:ALL will surely find them. Most of theese problems can be worked around with the /STANDARD=VAXC qualifier. 2) The C RTL are now automaticly searched by LINK, so no VAXC.OPT file anymore. (the routines are prefixed DECC$, if you need them in other languages) 3) The preprocessor are now a real ANSI C preprocessor. VAX C: #module ident string DEC C: #pragma module ident string VAX C: #define dmpi(i) printf("i" "=%d\n",i) DEC C: #define dmpi(i) printf(#i "=%d\n",i) VAX C: #define init(i) x_/**/i=i DEC C: #define init(i) x_##i=i 4) The builtins has changes significantly (obviously, since many of the builtins gives directly access to the instruction set). Among the many small goodies are __ALLOCA, which have been missed in VAX C for years. 5) In VAX C sizeof('a') returns sizeof(char), while DEC C sizeof('a') returns sizeof(int), also when /STANDARD=VAXC is specified. 6) Variables explicit declared as "extern" are no longer default implemented as psects but as global symbols. And as a consequence they do no longer match f.ex. COMMON BLOCK's in FORTRAN. To restore the old behaviour use: /EXTERN_MODEL=COMMON_BLOCK or #pragma extern_model common_block. The default behaviour is now /EXTERN_MODEL=RELAXED_REFDEF or #pragma extern_model relaxed_refdef. There are an entire manual "DEC C Migration Guide for OpenVMS VAX Systems" that discuss the VAX C to DEC C migration (on VMS VAX, but many of the facts can be used on AXP too). Conclusion: a proper port can take some time, but the /STANDARD=VAXC kludge can be used for a fast temporary port. Porting from FORTRAN 5.x to 6.x. ================================ There are some small differences. 1) Unless the /NOSEPERATE qualifier is used, then all subroutines/functions in one source-file is compiled to one object-module, which enables global optimization. [AXP only] 2) A new qualifier /RECURSIVE allows recursive programming in FORTRAN (it uses stack-allocation for local variables). I strongly recommend to always use /WARNING=UNINITIALIZED with /RECURSIVE, since uninitialized stack-variables are not zero ! 3) Capability of reading/writing non-native binary-formats (big endian, IBM floatings,IEEE floatings on VAX). 4) A new intrinsic function IARGCOUNT returns the number of arguments a subroutine/function is actually called with. This superseedes all kinds of small MACRO routine son VAX that do not work on AXP. [V6.1] 5) A new qualifier /WARNING=UNUSED helps you to clean up your code. 6) If you install FORTRAN 6.x on VMS VAX 5.x/6.0, then you get new FORRTL and MTHRTL, so EXE-files linked here can not be run on other VAX'es running the same VMS version, but without FORTRAN 6.x installed. And the images can not be VESTed on AXP either. The release notes describe how to workaround this problem (link with the old "standard" libraries) in section 1.1.1.3 ! In short: - use the /MATH=V5 qualifier - use the defines: $ DEFINE MTHRTL FORTRAN$MTHRTL-VMS $ DEFINE VMTHRTL FORTRAN$VMTHRTL-VMS before linking 7) FORTRAN on VAX supports the following piece of pseudo FORTRAN 66: CALL SUBBO('This is a string') SUBROUTINE SUBBO(S) INTEGER*4 S(*) (the compiler issues some information, that the linker uses to fixup the call) FORTRAN on AXP does not support this. 8) There is a bug in some versions of the compiler regarding generation of DST records for subroutines with extra ENTRY points. Steve Lionel has written a nice little guide, which is available via anonymous FTP ! (URL: ftp://ftp.digital.com/pub/Digital/info/whitepaper/migrating-fortran.ps) Conclusion: Easy porting, since most changes are new features. Porting from PASCAL 4.x to 5.x. =============================== I have not encountered any problems (but I have not ported very much code either, so do not jump to conclusions). The PASCAL compilers optimization on AXP is much better than on VAX (compared to the other compilers). Differences in the LINKer. ========================== There are a few differences with the linker too: 1) The C RTL are linked in by default (one options file out). 2) The transfer vectors are created different. VAX MAR-file: .MACRO TVGEN ZZZZ .TRANSFER ZZZZ .MASK ZZZZ JMP ZZZZ+2 .ENDM TVGEN TVGEN name1 TVGEN name2 TVGEN name3 AXP OPT-file: SYMBOL_VECTOR=(name1=PROCEDURE,name2=PROCEDURE,name3=PROCEDURE) Note that psects are tread the same way as procedures: namepsect=PSECT ! 3) The very usefull but undocumented/unsupported option UNSUPPORTED=1 are missing. There are other differences too. Study the LINKER manual carefull, if you have complex LINK setups. Problems with system services. ============================== In general all system services are present and behaves identical on VAX and AXP. Most of the differences are due to the differences in versions of VMS on VAX and AXP, which will eventually disappear when VSM VAX and VMS AXP merges. There are a few system-services that refer to pagecount and start/end addresses. The pagecounts are not a problem, because they refer to pagelets of 512 byte, not physical pages of 8192 bytes. But the start/end addresses can be a problem as input arguments. The addresses are rounded down and up to nearest page boundry and on AXP that is a physical page boundry of 8192 bytes ! INADR.C ======= #include #include #define N 100 int mem[N]; int sys$setprt(); main() { int stat; int inadr[2] = {&mem[0],((int)&mem[N])-1}; int retadr[2]; stat = sys$setprt(inadr,retadr,0,PRT$C_UR,0); printf("stat = %8x\ninadr = %8x %8x\nretadr = %8x %8x\n", stat, inadr[0],inadr[1], retadr[0],retadr[1]); } Problems with macro32. ====================== The purpose of the MACRO32 is easy porting of VMS and special priviliged user applications. As a consequence it is very effecient and low-level (and supports writing device drivers etc.). It is also not as good as it could have been for porting ordinary user applications. There are many good porting hints in "Migrating to an OpenVMS AXP System: Porting VAX MACRO Code". Declaration of JSB-routines. ---------------------------- JSB routines must be declared in the MACRO32 compiler. VAX MACRO32: JSB name . . . name: . . . RSB AXP MACRO32: JSB name . . . name: .JSB_ENTRY INPUT=,OUTPUT= . . . RSB If not input and output are declared, then the compiler assumes that there are no register dependencies and saves/restores everything, so the routine do not work properly. And no warnings ! Poor floating point support. ---------------------------- The MACRO32 compiler do not support floating point register operations. It only supports floating point operations on memory locations. This gives poor performance (it gets even worse because the floating point instructions are being translated into JSB calls). And be aware: the compiler do not give any warnings about the register operations - the results are simply wrong. Ignore directive. ----------------- The .LINK directive are ignored. The compiler do not give a warning, but the directive has no effect. Warnings. --------- Quadword operations, which the compiler can not be sure are natural aligned results in warnings. Entry masks. ------------ Remove the iv bit specification from entry-masks. Compilation results in errors. Bugs. ----- The MACRO32 compiler on VMS AXP 6.1 has a bug, which sometimes makes the EDIV instruction generate wrong results. Workaround: use LIB$EDIV ! Architectural features/problems. ================================ Integers. --------- AXP is a real 64 bit architecture, so it has 64 bit integers. FORTRAN: INTEGER*8 gives a 64 bit integer, and the qualifier /INTEGER_SIZE=64 makes it default for INTEGER C: you need to use the type __int64 (or if you include ints.h, then you can use int64) - this is not exactly emphasized in online help PASCAL: INTEGER64 is the dattype for 64 bit integers Conclusions: - new integer type with greater range Floating point. --------------- navn type/size/precisison VAX AXP F VAX 4 byte single full support full support D VAX 8 byte double full support only load/store all calculation as G G VAX 8 byte double full support full support H VAX 16 byte quadrouple hardware old VAX no support emulation new VAX S IEEE 4 byte single FORTRAN 6.x can full support read/write T IEEE 8 byte double FORTRAN 6.x can full support read/write X IEEE 16 byte quadrouple ? emulation in FORTRAN 6.2 ? Compiler switches: FORTRAN/PASCAL/C F default on AXP S /FLOAT=IEEE D /FLOAT=D G default on AXP T /FLOAT=IEEE Conclusions: - new standard datatype (IEEE, used by Intel and various RISC/UNIX platforms) - H-floating is missing - G-floating has beaten D-floating Procedure calling standard. --------------------------- AXP is not using the VAX procedure calling standard. In an attempt to speed up procedure calls the first 6 arguments are no longer passed in a seperate argumentlist, but in 6 hardware registers R16-R21 for integer and F16-F21 for floating point. The VAX proceduure stack frame are not used anymore either. Consequences: 1) It is not easy to write a routine NARG, which in the call-sequence X -> Y -> NARG can tell Y how many arguments X called Y with. FORTRAN now has the IARGCOUNT function to get this information. C has the standard stdarg/vararg. PASCAL also has facilities to do this. MACRO32 code can still refer directly to (AP). 2) It is no longer possible to write a routine that inserts an error-handler for the routine calling this routine. The LIB$ESTABLISH/LIB$REVERT no longer exist as routines. The FORTRAN and C compilers on AXP recognizes them and generates inline code. PASCAL and MACRO32 programs can not use them. MACRO32 code can still refer directly to (FP). Both C and PASCAL has its own error-handling routines, that still works. 3) Fake returns do not work any longer too. (A calls B calls C returns to A) 4) Tracing via the call stack is not possible the same way on VMS AXP as VMS VAX. All MACRO code refferring to something(FP) should be carefully examined. There are some good descriptions in "OpenVMS Calling Standard". For those who want to start playing with the VMS AXP stacks, then the routines LIB$GET_CURRENT_INVO_CONTEXT and LIB$GET_PREVIOUS_INVO_CONTEXT are the way to go (they are documented in the manual mentioned above). Problems with undocumented features. ==================================== Program linked directly with VMS. --------------------------------- VAX: $ LINK anything+SYS$SYSTEM:SYS.STB/SELECTIVE_SEARCH (or .link "sys$system:sys.stb"/selective_search in MACRO32 code) AXP: $ LINK/SYSEXE anything (or SYS$LOADABLE_IMAGES:SYS$BASE_IMAGE.EXE/SHARE in options-file) (DCLDEF.STB are still there) Size of SYS0 fields. -------------------- Certain fields in SYS0 space has changed size. Example: SYS$GW_IJOBCNT -> SYS$GL_IJOBCNT. SYS0.C on VAX ------------- #include globalref sys$gw_ijobcnt; main() { printf("Number of interactive jobs=%d\n",sys$gw_ijobcnt); } SYS0.C on AXP ------------- #include globalref sys$gl_ijobcnt; main() { printf("Number of interactive jobs=%d\n",sys$gl_ijobcnt); } Read/write in other process address-space. ------------------------------------------ The well-known hack for reading/writing in another process address-space via a double/single special kernel mode AST are extremely difficult in VMS AXP. It is said to be possible, but DEC has supplied to new routines (undocumented/unsupported) EXE$READ_PROCESS and EXE$WRITE_PROCESS to do it. Format of EXE-files. -------------------- The format of EXE-files has changed. Any program reading and interpreting image header and GST/DST must be rewritten. See the difference between the $IHADEF/$IHDDEF and $EIHADEF/$EIHDDEF macros for details. SYS$IMGACT. ----------- I have not yet been able to get the undocumented/unsupported system service SYS$IMGACT to work properly on VMS AXP. MACRO64 assembler. ================== It is actually rather simple to use. It i mostly like MACRO32, but one has to get used to the load/store way of addressing. I doubt that it will ever be very musch used. It is not bundled with VMS. MACRO32 is still used for priviliged code. DEC has announced that writing device-drivers in C will be supported soon. VEST. ===== The purpose of VEST is easy porting of user applications with no direct dependence of VMS. It is based on both translation and interpretation. I have moved two applicatiosn with VEST: - a 75 block executable with a 500 block shareable - a 500 block executable Both programs worked fine after VESTing. Efficiency chart: VAX 4200 DEC 3400 (rated at 5 VUPS) (rated at 110 Specmark) VESTed native integer 13240 2260 430 calculations floating point 36510 2470 420 calculations character 25420 8500 2250 moves Numbers=milliseconds of CPU usage on specified operation on small data. Conclusions: - VEST gived an overhead of a factor 3-5 - the gain of the VAX -> AXP change depends on the type of operations (integer: factor 25, floating point: factor 75,character: 10) [DEC's tests on the Specmark programs says a factor 2-6 in VEST overhead and the hardware upgrade scaled to same number of VUPS gives a factor 12-56] VEST.FOR ======== PROGRAM VEST CALL TESTINT CALL TESTFLOAT CALL TESTCHAR END C SUBROUTINE TESTINT INTEGER*4 I,J,T,A(4000),B(4000) INTEGER*4 PAS$CLOCK2 DO 100 I=1,4000 A(I)=I 100 CONTINUE T=PAS$CLOCK2() DO 300 J=1,1000 DO 200 I=4,3997 B(I)=(A(I-3)+A(I-2)+2*A(I-1)+8*A(I)+2*A(I+1)+A(I+2)+A(I+3))/16 200 CONTINUE CALL FOOL(B) 300 CONTINUE WRITE(*,*) PAS$CLOCK2()-T RETURN END C SUBROUTINE TESTFLOAT INTEGER*4 I,J,T REAL*8 A(4000),B(4000) INTEGER*4 PAS$CLOCK2 DO 100 I=1,4000 A(I)=I 100 CONTINUE T=PAS$CLOCK2() DO 300 J=1,1000 DO 200 I=4,3997 B(I)=0.0625D0*A(I-3)+ + 0.0625D0*A(I-2)+ + 0.125D0*A(I-1)+ + 0.5D0*A(I)+ + 0.125D0*A(I+1)+ + 0.0625D0*A(I+2)+ + 0.0625D0*A(I+3) 200 CONTINUE CALL FOOL(B) 300 CONTINUE WRITE(*,*) PAS$CLOCK2()-T RETURN END C SUBROUTINE TESTCHAR INTEGER*4 I,J,T CHARACTER*4000 A,B INTEGER*4 PAS$CLOCK2 DO 100 I=1,4000 A(I:I)=CHAR(MOD(I,256)) 100 CONTINUE T=PAS$CLOCK2() DO 300 J=1,1000 DO 200 I=1,3901 B(I:I+49)=A(I+50:I+99) 200 CONTINUE CALL FOOL(B) 300 CONTINUE WRITE(*,*) PAS$CLOCK2()-T RETURN END C SUBROUTINE FOOL RETURN END