24KEC SIMD

This topic contains 8 replies, has 3 voices, and was last updated by  Avantika 2 years, 3 months ago.

Viewing 9 posts - 1 through 9 (of 9 total)
  • Author
    Posts
  • #48805

    shabeer
    Member

    Hi
    i am using 24kec core …changed my 16X16 code into simd like show below but getting very worse performance
    than without any simd. any suggestions?

    accum[0]=dpaq_s_w_ph(accum[0],curr_in1,(psinc_simd1[0]));
    accum[0]=dpaq_s_w_ph(accum[0],curr_in2,(psinc_simd2[4]));
    accum[1]=dpaq_s_w_ph(accum[1],curr_in1,(psinc_simd1[1]));
    accum[1]=dpaq_s_w_ph(accum[1],curr_in2,(psinc_simd2[5]));
    accum[2]=dpaq_s_w_ph(accum[2],curr_in1,(psinc_simd1[2]));
    accum[2]=dpaq_s_w_ph(accum[2],curr_in2,(psinc_simd2[6]));
    accum[3]=dpaq_s_w_ph(accum[3],curr_in1,(psinc_simd1[3]));
    accum[3]=dpaq_s_w_ph(accum[3],curr_in2,(psinc_simd2[7]));

    #48813

    shabeer
    Member

    The compiler seems to generate wrong assembly code. that means dpaq_s_w_ph uses same register accumulator 0 for all the mac operations on all array elements. code will do store and load of acumulator in between the dpaq_s_w_ph operations. How to make sure dpaq_s_w_ph uses different acumulators for differnt variables instead of using the same accumulator 0. Could any one help?

    #49001

    yanan
    Member

    Hi Shabeer,

    Could you show how your dpaq_s_w_ph is implemented?

    There is a built-in function __builtin_mips_dpaq_s_w_ph for DPAQ_S.W.PH instruction.

    #49004

    shabeer
    Member

    THANKS FOR THE REPLAY

    dpaq_s_w_ph is a macro defined to be __builtin_mips_dpaq_s_w_ph. like

    #define dpaq_s_w_ph __builtin_mips_dpaq_s_w_ph

    i seen that it generates DPAQ_S.W.PH in objdump .
    the problem is ,it is not using all the accumulators because of that it generates code to store restore accumulator.
    i am using mips gcc compiler coming with android build system . dsp and optimization flags(mdsp,O3) are are set.

    #49015

    yanan
    Member

    Shabeer,

    So what is the gcc version in your android tools?

    And O3 always do more aggressive optimization. Sometimes the result will surprise you.

    If you could show more code or a simple project, I think I could help.

    Regards,

    Yanan

    #49064

    shabeer
    Member

    Hi
    i am using GCC version 4.6. the simple code given below can be taken as an example.
    generated assembly is given below. I dont undetsand why the compiler generates many of mtlo,mthi,mflo,mfhi(inside the loop)instructions which are seems not needed as we are making the accumulators zero before the loop. and why should accumulators loaded to and back from registers. accumulators will be used to accumulate to inside the loop.

    Thanks
    shabeer
    ###################################
    a64 res = 0;
    v2q15 v1[4] = {1,2,3,4,5,6,7,8};
    a64 accum0;

    register a64 accum1 asm (“$ac1lo”);
    register a64 accum2 asm (“$ac2lo”);
    register a64 accum3 asm (“$ac3lo”);

    void check_madd_xxx(int n)
    {
    accum0=0;accum1=0;accum2=0;accum3=0;

    for(int i = 0; i < n; i++) {
    accum0 = __builtin_mips_dpaq_s_w_ph(accum0, v1[0], v1[0]);
    accum1 = __builtin_mips_dpaq_s_w_ph(accum1, v1[1], v1[1]);
    accum2 = __builtin_mips_dpaq_s_w_ph(accum2, v1[2], v1[2]);
    accum3 = __builtin_mips_dpaq_s_w_ph(accum3, v1[3], v1[3]);
    }

    res = accum0 + accum1 + accum2 + accum3;
    }
    ####################################
    ##GENERATED ASSEMBLY
    00080600 <_Z14check_madd_xxxi>:
    80600: 3c1c0001 lui gp,0x1
    80604: 279c9a10 addiu gp,gp,-26096
    80608: 0399e021 addu gp,gp,t9
    8060c: 8f888018 lw t0,-32744(gp)
    80610: 00001021 move v0,zero
    80614: 00001821 move v1,zero
    80618: ad020000 sw v0,0(t0)
    8061c: ad030004 sw v1,4(t0)
    80620: 00000813 mtlo zero,$ac1
    80624: 00000811 mthi zero,$ac1
    80628: 00001013 mtlo zero,$ac2
    8062c: 00001011 mthi zero,$ac2
    80630: 00001813 mtlo zero,$ac3
    80634: 00001811 mthi zero,$ac3
    80638: 00003821 move a3,zero
    8063c: 1000001d b 806b4 <_Z14check_madd_xxxi+0xb4>
    80640: 8f85801c lw a1,-32740(gp)
    80644: 8d090004 lw t1,4(t0)
    80648: 00400013 mtlo v0
    8064c: 8ca60000 lw a2,0(a1)
    80650: 01200011 mthi t1
    80654: 0020c012 mflo t8,$ac1
    80658: 00207810 mfhi t7,$ac1
    8065c: 00406812 mflo t5,$ac2
    80660: 00406010 mfhi t4,$ac2
    80664: 00605012 mflo t2,$ac3
    80668: 00604810 mfhi t1,$ac3
    8066c: 7cc60130 dpaq_s.w.ph $ac0,a2,a2
    80670: 00001812 mflo v1
    80674: 0000c810 mfhi t9
    80678: 8cae0004 lw t6,4(a1)
    8067c: 03000813 mtlo t8,$ac1
    80680: 01e00811 mthi t7,$ac1
    80684: 8cab0008 lw t3,8(a1)
    80688: 01a01013 mtlo t5,$ac2
    8068c: 01801011 mthi t4,$ac2
    80690: 8ca6000c lw a2,12(a1)
    80694: 01401813 mtlo t2,$ac3
    80698: 01201811 mthi t1,$ac3
    8069c: 7dce0930 dpaq_s.w.ph $ac1,t6,t6
    806a0: ad030000 sw v1,0(t0)
    806a4: 7d6b1130 dpaq_s.w.ph $ac2,t3,t3
    806a8: ad190004 sw t9,4(t0)
    806ac: 7cc61930 dpaq_s.w.ph $ac3,a2,a2
    806b0: 24e70001 addiu a3,a3,1
    806b4: 00e4502a slt t2,a3,a0
    806b8: 5540ffe2 bnezl t2,80644 <_Z14check_madd_xxxi+0x44>
    806bc: 8d020000 lw v0,0(t0)
    806c0: 00205812 mflo t3,$ac1
    806c4: 8f8d8018 lw t5,-32744(gp)
    806c8: 00204010 mfhi t0,$ac1
    806cc: 00403012 mflo a2,$ac2
    806d0: 8dac0000 lw t4,0(t5)
    806d4: 00401810 mfhi v1,$ac2
    806d8: 8da50004 lw a1,4(t5)
    806dc: 00607812 mflo t7,$ac3
    806e0: 016c5021 addu t2,t3,t4
    806e4: 00606810 mfhi t5,$ac3
    806e8: 01053821 addu a3,t0,a1
    806ec: 014b102b sltu v0,t2,t3
    806f0: 01467021 addu t6,t2,a2
    806f4: 00474821 addu t1,v0,a3
    806f8: 01cac02b sltu t8,t6,t2
    806fc: 0123c821 addu t9,t1,v1
    80700: 8f848020 lw a0,-32736(gp)
    80704: 01cf2821 addu a1,t6,t7
    80708: 03196021 addu t4,t8,t9
    8070c: 00ae402b sltu t0,a1,t6
    80710: 018d5821 addu t3,t4,t5
    80714: 010b3821 addu a3,t0,t3
    80718: ac850000 sw a1,0(a0)
    8071c: 03e00008 jr ra
    80720: ac870004 sw a3,4(a0)

    #49186

    yanan
    Member

    Shabeer,

    Yes, I saw a similar problem. Even I added “-ffixed-reg” options, the builtin functions still use other ac registers.

    But I don’t think this is same as your issue. I wrote some complex code which uses many builtin DSP functions before, I checked line by line, I think it is efficient enough.

    Regards,

    Yanan

    #49226

    shabeer
    Member

    Thank you so much for your time!!

    I think here line adress 80644 to 806b8 is the looping part. I still dint understand why
    code generated to take values out of accumulator(eg: mflo t8,$ac1) and then take it back to same accumulator(ie:mtlo t8,$ac1)

    I was expecting a loop with loading and mac ing instructions only.i think that will be more efficient

    please correct me if i am wrong and is there any better way to tel compiler what i want?

    Regards
    Shabeer Khan

    #49289

    Avantika
    Member

    Please try to define acc as local variables in a way a64 accum0, accum1, accum2, accum3;
    There should not be any other global 64 bit variables defined. Do not define more than 4 64 bit variables (please check with 3 acc first).

Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic.