CI20's FPU performance issue

Tagged: 

This topic contains 6 replies, has 2 voices, and was last updated by  Martin Krastev 2 years, 1 month ago.

Viewing 7 posts - 1 through 7 (of 7 total)
  • Author
    Posts
  • #49748

    Hello,

    Is there a known issue with Ingenic’s FPU where the latter could underperform about an order of magnitude from its expected levels? Something that disables the pipelining in the FPU, perhaps?

    I’m running a very rudimentary matrix (4×4) multiplication test here, using all madd.s and mul.s and I’m observing ~1/10 of the performance one’d expect.

    Here’s the disassembly:


    402594: 00923021 addu a2,a0,s2
    402598: c4c00058 lwc1 $f0,88(a2)
    40259c: c4c10024 lwc1 $f1,36(a2)
    4025a0: c4c20050 lwc1 $f2,80(a2)
    4025a4: c4c30034 lwc1 $f3,52(a2)
    4025a8: c4c50004 lwc1 $f5,4(a2)
    4025ac: c4c40054 lwc1 $f4,84(a2)
    4025b0: c4c6005c lwc1 $f6,92(a2)
    4025b4: c4c70014 lwc1 $f7,20(a2)
    4025b8: 46042a02 mul.s $f8,$f5,$f4
    4025bc: 46000a82 mul.s $f10,$f1,$f0
    4025c0: 46040ac2 mul.s $f11,$f1,$f4
    4025c4: 46020b02 mul.s $f12,$f1,$f2
    4025c8: 46002c02 mul.s $f16,$f5,$f0
    4025cc: 46022c42 mul.s $f17,$f5,$f2
    4025d0: 46061c82 mul.s $f18,$f3,$f6
    4025d4: 46062942 mul.s $f5,$f5,$f6
    4025d8: 46060842 mul.s $f1,$f1,$f6
    4025dc: 46063982 mul.s $f6,$f7,$f6
    4025e0: c4ce0040 lwc1 $f14,64(a2)
    4025e4: c4d50020 lwc1 $f21,32(a2)
    4025e8: 46021a42 mul.s $f9,$f3,$f2
    4025ec: 46001cc2 mul.s $f19,$f3,$f0
    4025f0: e7a6001c swc1 $f6,28(sp)
    4025f4: 4d8ea9a0 madd.s $f6,$f12,$f21,$f14
    4025f8: 46023882 mul.s $f2,$f7,$f2
    4025fc: 46003802 mul.s $f0,$f7,$f0
    402600: 46043bc2 mul.s $f15,$f7,$f4
    402604: 460418c2 mul.s $f3,$f3,$f4
    402608: c4cd0044 lwc1 $f13,68(a2)
    40260c: c4c40030 lwc1 $f4,48(a2)
    402610: c4d40048 lwc1 $f20,72(a2)
    402614: 4c920600 lwxc1 $f24,s2(a0)
    402618: c4d60010 lwc1 $f22,16(a2)
    40261c: c4d7004c lwc1 $f23,76(a2)
    402620: c7ac001c lwc1 $f12,28(sp)
    402624: e7a60018 swc1 $f6,24(sp)
    402628: 4cb7c1a0 madd.s $f6,$f5,$f24,$f23
    40262c: 4c4eb160 madd.s $f5,$f2,$f22,$f14
    402630: 4d2e2260 madd.s $f9,$f9,$f4,$f14
    402634: 4c37a8a0 madd.s $f2,$f1,$f21,$f23
    402638: 4e2ec3a0 madd.s $f14,$f17,$f24,$f14
    40263c: 4c14b060 madd.s $f1,$f0,$f22,$f20
    402640: 4d54aaa0 madd.s $f10,$f10,$f21,$f20
    402644: 4e7424e0 madd.s $f19,$f19,$f4,$f20
    402648: 4e14c420 madd.s $f16,$f16,$f24,$f20
    40264c: 4d0dc220 madd.s $f8,$f8,$f24,$f13
    402650: 4d6daae0 madd.s $f11,$f11,$f21,$f13
    402654: 4c6d20e0 madd.s $f3,$f3,$f4,$f13
    402658: 4dedb360 madd.s $f13,$f15,$f22,$f13
    40265c: 4e572120 madd.s $f4,$f18,$f4,$f23
    402660: 4d97b5a0 madd.s $f22,$f12,$f22,$f23
    402664: c4c70060 lwc1 $f7,96(a2)
    402668: c4d90008 lwc1 $f25,8(a2)
    40266c: c4db0068 lwc1 $f27,104(a2)
    402670: c4df0018 lwc1 $f31,24(a2)
    402674: c4da0064 lwc1 $f26,100(a2)
    402678: c4dc0028 lwc1 $f28,40(a2)
    40267c: c4dd0038 lwc1 $f29,56(a2)
    402680: c7ac0018 lwc1 $f12,24(sp)
    402684: c4de006c lwc1 $f30,108(a2)
    402688: 4c3bf860 madd.s $f1,$f1,$f31,$f27
    40268c: 4e1bcc20 madd.s $f16,$f16,$f25,$f27
    402690: 4dc7cba0 madd.s $f14,$f14,$f25,$f7
    402694: 4d27ea60 madd.s $f9,$f9,$f29,$f7
    402698: 4d5be2a0 madd.s $f10,$f10,$f28,$f27
    40269c: 4d87e320 madd.s $f12,$f12,$f28,$f7
    4026a0: 4ca7f960 madd.s $f5,$f5,$f31,$f7
    4026a4: 4d1ac9e0 madd.s $f7,$f8,$f25,$f26
    4026a8: 4e7bea20 madd.s $f8,$f19,$f29,$f27
    4026ac: 4edefda0 madd.s $f22,$f22,$f31,$f30
    4026b0: 4dbafb60 madd.s $f13,$f13,$f31,$f26
    4026b4: 4c9ee920 madd.s $f4,$f4,$f29,$f30
    4026b8: 4c7ae8e0 madd.s $f3,$f3,$f29,$f26
    4026bc: 4c5ee0a0 madd.s $f2,$f2,$f28,$f30
    4026c0: 4d7ae2e0 madd.s $f11,$f11,$f28,$f26
    4026c4: 4cdec9a0 madd.s $f6,$f6,$f25,$f30
    4026c8: c4d50070 lwc1 $f21,112(a2)
    4026cc: c4d8002c lwc1 $f24,44(a2)
    4026d0: c4cf001c lwc1 $f15,28(a2)
    4026d4: c4c00078 lwc1 $f0,120(a2)
    4026d8: c4d7003c lwc1 $f23,60(a2)
    4026dc: c4d4000c lwc1 $f20,12(a2)
    4026e0: c4d10074 lwc1 $f17,116(a2)
    4026e4: c4d2007c lwc1 $f18,124(a2)
    4026e8: 4d00ba20 madd.s $f8,$f8,$f23,$f0
    4026ec: 4e00a420 madd.s $f16,$f16,$f20,$f0
    4026f0: 4c207860 madd.s $f1,$f1,$f15,$f0
    4026f4: 4d40c020 madd.s $f0,$f10,$f24,$f0
    4026f8: 4dd5a2a0 madd.s $f10,$f14,$f20,$f21
    4026fc: 4cf1a1e0 madd.s $f7,$f7,$f20,$f17
    402700: 4cd2a1a0 madd.s $f6,$f6,$f20,$f18
    402704: 4cb57960 madd.s $f5,$f5,$f15,$f21
    402708: 4db17b60 madd.s $f13,$f13,$f15,$f17
    40270c: 4ed27be0 madd.s $f15,$f22,$f15,$f18
    402710: 4d95c320 madd.s $f12,$f12,$f24,$f21
    402714: 4d71c2e0 madd.s $f11,$f11,$f24,$f17
    402718: 4c52c0a0 madd.s $f2,$f2,$f24,$f18
    40271c: 4d35ba60 madd.s $f9,$f9,$f23,$f21
    402720: 4c71b8e0 madd.s $f3,$f3,$f23,$f17
    402724: 4c92b920 madd.s $f4,$f4,$f23,$f18
    402728: 02430821 addu at,s2,v1
    40272c: 2442ffff addiu v0,v0,-1
    402730: 00b23821 addu a3,a1,s2
    402734: 4cb25008 swxc1 $f10,s2(a1)
    402738: 00209021 move s2,at
    40273c: e4e4003c swc1 $f4,60(a3)
    402740: e4e80038 swc1 $f8,56(a3)
    402744: e4e30034 swc1 $f3,52(a3)
    402748: e4e90030 swc1 $f9,48(a3)
    40274c: e4e2002c swc1 $f2,44(a3)
    402750: e4e00028 swc1 $f0,40(a3)
    402754: e4eb0024 swc1 $f11,36(a3)
    402758: e4ec0020 swc1 $f12,32(a3)
    40275c: e4ef001c swc1 $f15,28(a3)
    402760: e4e10018 swc1 $f1,24(a3)
    402764: e4ed0014 swc1 $f13,20(a3)
    402768: e4e50010 swc1 $f5,16(a3)
    40276c: e4e6000c swc1 $f6,12(a3)
    402770: e4f00008 swc1 $f16,8(a3)
    402774: 1440ff87 bnez v0,402594 <main+0x22c>

    #49757

    I realize this subject requires some elaboration.

    My expectations of the ci20’s FPU baseline performance are based on the assumption that the FPU unit is actually pipelined. I’m saying ‘assumption’ as even though I seem to remember reading that the FPU is pipelined, I haven’t been able to track this bit of information back to its source. So I might be misremembering things, or attributing the info about one cpu to another. After all the rudimentary test from the original post of this thread does not indicate a pipelined FPU is present in the ci20.

    So herein comes the one question I should have asked at the very start:

    Is JZ4780’s FPU pipelined?

    #49803

    ZubairLK
    Member

    Hi Martin,

    Welcome to the forum.

    Ingenic’s manuals are a bit sparse about details on the FPU.

    I’m afraid we also don’t have an idea of what kind of performance to expect.

    We did run a short test to see if the FPU instructions are being emulated by the cpu because of some configuration issue. But the kernel wasn’t reporting that it was emulating instructions.

    Regards,
    ZubairLK

    #49869

    Hi Zubair,

    Thank you for the welcome. I’ve been enjoying the ci20 and its fairly mature GL ES stack. But I think it would be nice if the basic performance characteristics of the FPU were known in advance – setting realistic expectations is always a good thing. In this line of thoughts, is there a way we could ask Ingenic about the state of FPU pipelining in the JZ4780?

    Regards,
    Martin

    #49968

    ZubairLK
    Member

    Hi,

    We’ve asked Ingenic for some documentation about the FPU. Lets see…

    Regards,
    ZubairLK

    #49988

    ZubairLK
    Member

    Hi Martin,

    Ingenic replied saying that the FPU in JZ4780 is ‘not’ pipelined…

    Hope that clears the confusion.

    Regards,
    ZubairLK

    #49991

    Hi Zubair,

    Thank you for the follow up on the matter. It does clear the confusion.

    Best regards,
    Martin

Viewing 7 posts - 1 through 7 (of 7 total)
The forum ‘Creator Platforms’ is closed to new topics and replies.