NAS

NAS performance with kernel 3.19.0

5 7863
tkaiser  
Edited by tkaiser at Wed Feb 11, 2015 03:43

Based on Igor's image build scripts I started with Kernel 3.19.0-rc5 and Debian Wheezy. Currently the cpufreq stuff isn't working in 3.19 and there's a lot of related work pending or will be available later with kernel 3.20. Since I tried to maximize performance I tried a few things, especially clock the CPU cores as high as possible (the default cpufreq in mainline is just 912 MHz!)

From Igor's build directory I patched output/u-boot/include/configs/sun7i.h from 912 MHz to 1056 MHz (everything above failed in my limited tests). Setting the cpufreq in u-boot is currently the only method supported:
  1. #define CONFIG_CLK_FULL_SPEED                1056000000
Copy the Code
And I modified the kernel config since it is known that some settings positively influence SMP performance on ARM (compare with http://forum.lemaker.org/thread-7102-1-1.html). So I set in lib/config/linux-sunxi-next.config:
  1. CONFIG_SCHED_MC=y
  2. CONFIG_SCHED_SMT=y
Copy the Code
(CONFIG_HZ_100=y and CONFIG_HZ=100 were already set before).

Isolated network tests / iperf throughput:

Tests without further tuning (only eth0 IRQs handled by CPU1 and slight 'overclocking' to 1056MHz...

OS X --> BPi:
  1. root@bananapi:~# iperf -s
  2. ------------------------------------------------------------
  3. Server listening on TCP port 5001
  4. TCP window size: 85.3 KByte (default)
  5. ------------------------------------------------------------
  6. [  4] local 192.168.83.44 port 5001 connected with 192.168.83.247 port 63446
  7. [ ID] Interval       Transfer     Bandwidth
  8. [  4]  0.0-10.0 sec  1.09 GBytes   934 Mbits/sec
  9. [  5] local 192.168.83.44 port 5001 connected with 192.168.83.247 port 63447
  10. [  5]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
  11. [  4] local 192.168.83.44 port 5001 connected with 192.168.83.247 port 63448
  12. [  4]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
  13. [  5] local 192.168.83.44 port 5001 connected with 192.168.83.247 port 63449
  14. [  5]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
  15. [  4] local 192.168.83.44 port 5001 connected with 192.168.83.247 port 63450
  16. [  4]  0.0-10.0 sec  1.09 GBytes   939 Mbits/sec
Copy the Code
BPi --> OS X:
  1. mountain-mini:~ admin$ iperf -s
  2. ------------------------------------------------------------
  3. Server listening on TCP port 5001
  4. TCP window size:  128 KByte (default)
  5. ------------------------------------------------------------
  6. [  4] local 192.168.83.247 port 5001 connected with 192.168.83.44 port 45648
  7. [ ID] Interval       Transfer     Bandwidth
  8. [  4]  0.0-10.0 sec   689 MBytes   577 Mbits/sec
  9. [  4] local 192.168.83.247 port 5001 connected with 192.168.83.44 port 45649
  10. [  4]  0.0-10.0 sec   800 MBytes   671 Mbits/sec
  11. [  4] local 192.168.83.247 port 5001 connected with 192.168.83.44 port 45650
  12. [  4]  0.0-10.0 sec   803 MBytes   673 Mbits/sec
  13. [  4] local 192.168.83.247 port 5001 connected with 192.168.83.44 port 45651
  14. [  4]  0.0-10.0 sec   806 MBytes   676 Mbits/sec
  15. [  4] local 192.168.83.247 port 5001 connected with 192.168.83.44 port 45652
  16. [  4]  0.0-10.0 sec   798 MBytes   669 Mbits/sec
Copy the Code
SATA / iozone results:

In 3.19 it's no longer necessary to patch sunxi_ahci to work either with or without a port multiplier.

EDIT: That's not true. I had to patch the kernel (unset AHCI_HFLAG_NO_PMP) and unfortunately my JMB321 seems to not like my SSD that much. Regardless on which PM port I connect the SSD the negotiation will always just be SATA 1.0 (1.5 Gbps). That's also the explanation for the decreased read throughput.

I tested with a JMB321 in between Banana and the EVO 840:
  1.      KB  reclen   write rewrite    read    reread
  2. 2048000       4   42281   42694   137476   137470
  3. 2048000      32   42025   42261   137427   137461
  4. 2048000     512   41590   41938   135917   136992
  5. 2048000   16384   41495   42008   136863   136903
Copy the Code
It makes no difference whether CONFIG_SCHED_MC=y and CONFIG_SCHED_SMT=y are set or not. Different situation when there's no PM as bottleneck:

"CONFIG_SCHED_MC not set" and CONFIG_SCHED_SMT not set":
  1.      KB  reclen   write rewrite    read    reread
  2. 2048000       4   43053   43643   202636   203209
  3. 2048000      32   42743   42949   184831   183612
  4. 2048000     512   41774   42533   174186   175606                                                                          
  5. 2048000   16384   41897   42500   175786   172778
Copy the Code
CONFIG_SCHED_MC=y and CONFIG_SCHED_SMT=y:
  1.      KB  reclen   write rewrite    read    reread
  2. 2048000       4   44413   44092   212359   212756
  3. 2048000      32   44078   44115   191354   192068
  4. 2048000     512   44167   43707   184329   184778
  5. 2048000   16384   43238   43798   185875   186162
Copy the Code
Combined tests / LanTest results:



The tweaked kernel settings seem to provide another few MB/sec due to higher SATA throughput. Interestingly the situation changes when one uses the '10 GBit Ethernet settings' (larger test files and 1024K blocksize instead of 128K as before):



Looks promising especially when you have in mind that it will be possible to clock the BPi with 1200 MHz again sooner or later (and then throughput will increase again a little blt).
tkaiser  
Some final words regarding this test: I removed the PM again and let Igor's image install on the SSD (/root/nand-sata-install.sh) and applied some network tuning:
  1. sysctl -w net/core/rmem_max=8738000
  2. sysctl -w net/core/wmem_max=6553600
  3. sysctl -w net/ipv4/tcp_rmem="8192 873800 8738000"
  4. sysctl -w net/ipv4/tcp_wmem="4096 655360 6553600"
  5. sysctl -w vm/min_free_kbytes=65536
  6. sysctl -w net.ipv4.tcp_window_scaling=1
  7. sysctl -w net.ipv4.tcp_timestamps=1
  8. sysctl -w net.ipv4.tcp_sack=1
  9. sysctl -w net.ipv4.tcp_no_metrics_save=1
Copy the Code
And I assigned the 3 processes serving my Mac (cnid_metad, cnid_dbd and the afpd serving my session) to CPU 0 (if you would use Windows/Samba this would apply to the smbd process serving your session instead):
  1. for i in 2488 2494 2495 ; do taskset -p 01 $i ; done
Copy the Code
Now it looks like this (still with just 1056 MHz):



And with 3000M filesize and 1024K blocksize:



The measured values stayed pretty stable during all 5 test runs. Some explanations to the LanTest parameters as well as their relation to 'real world scenarios' can be found here: http://www.helios.de/viewart-de.html?id=1711

tkaiser  
Edited by tkaiser at Mon Feb 9, 2015 05:52

I made another test with Igor's build scripts, 3.19.0-rc6, different kernel sources (using Maxime Ripard's sunxi-next branch instead of 'official' mainline), slightly different kernel config, different u-boot (using mainline instead of Robert C. Nelson's) but still 1056Mhz instead of the default 912Mhz.

SATA throughput measured with iozone is a bit slower and this also applies to network throughput measured with iperf after the usual TCP/IP tuning outlined above:



The LanTest results look ok but the orange triangles indicate that there's something wrong (sequential writes between 24.4 MB/s min. and 38.2 MB/s max. --> too much difference) but it's too early to do serious benchmarking because I changed another thing too: the board itself



(it's a LinkSprite pcDuino3 Nano comparable to the original Banana Pi. I applied Adam Sampson's patches to u-boot and kernel sources and modified Igor's scripts a bit. Two problems remain: currently broken mainline u-boot not able to read/parse boot.scr and maybe the DRAM clock frequency that Adam set to 408 -- will try with 480 later)

BTW: The pcDuino's SATA power connector is compatible with the cubie's but not the Pi. So SATA/power cables made for pcDuino or cubieboards must not be used with Banana/Orange Pi and vice versa!

tkaiser  
Hmm... I increased DRAM clock to 480: http://pastebin.com/kCusGxQm

And while it seems to be stable unfortunately the network throughput doesn't increase (compared to the BPi it's way slower). I've done some standard iperf tests:

OS X --> pcDuino
  1. [  4]  0.0-10.0 sec   570 MBytes   478 Mbits/sec
  2. [  4] local 192.168.83.74 port 5001 connected with 192.168.83.73 port 45115
  3. [  4]  0.0-10.0 sec   594 MBytes   498 Mbits/sec
  4. [  4] local 192.168.83.74 port 5001 connected with 192.168.83.73 port 45116
  5. [  4]  0.0-10.0 sec   840 MBytes   704 Mbits/sec
  6. [  4] local 192.168.83.74 port 5001 connected with 192.168.83.73 port 45117
  7. [  4]  0.0-10.0 sec   611 MBytes   512 Mbits/sec
  8. [  4] local 192.168.83.74 port 5001 connected with 192.168.83.73 port 45118
  9. [  4]  0.0-10.0 sec   604 MBytes   506 Mbits/sec
  10. [  4] local 192.168.83.74 port 5001 connected with 192.168.83.73 port 45119
  11. [  4]  0.0-10.0 sec   716 MBytes   600 Mbits/sec
Copy the Code
pcDuino --> OS X
  1. [  4]  0.0-10.0 sec   763 MBytes   640 Mbits/sec
  2. [  5] local 192.168.83.73 port 5001 connected with 192.168.83.74 port 49502
  3. [  5]  0.0-10.0 sec   389 MBytes   326 Mbits/sec
  4. [  4] local 192.168.83.73 port 5001 connected with 192.168.83.74 port 49503
  5. [  4]  0.0-10.0 sec   400 MBytes   336 Mbits/sec
  6. [  5] local 192.168.83.73 port 5001 connected with 192.168.83.74 port 49505
  7. [  5]  0.0-10.0 sec   397 MBytes   333 Mbits/sec
  8. [  4] local 192.168.83.73 port 5001 connected with 192.168.83.74 port 49507
  9. [  4]  0.0-10.0 sec   834 MBytes   699 Mbits/sec
  10. [  5] local 192.168.83.73 port 5001 connected with 192.168.83.74 port 49509
  11. [  5]  0.0-10.0 sec   788 MBytes   661 Mbits/sec
Copy the Code
And did two times iozone tests on the SATA SSD (using "iozone -a -g 2000m -s 2000m -i 0 -i 1 -r${recordsize}k" with different record sizes):
  1. 2048000       4   41588   47520   218494   219156
  2. 2048000       4   41247   48200   218171   219174
  3. 2048000      32   40930   48094   196609   197547
  4. 2048000      32   41417   47526   198031   198522
  5. 2048000     512   40384   47232   192915   192401
  6. 2048000     512   42114   46416   192788   193241
  7. 2048000   16384   46025   44079   195094   195918
  8. 2048000   16384   43941   46504   222510   223501
Copy the Code
LanTest results look good except of write performance. Here the network throughput seems to be the bottleneck:

hmmm nice results

tkaiser  
rutierut replied at Mon Feb 9, 2015 05:12
hmmm nice results

Unfortunately partly wrong or let's better say: not comparable directly. I made another set of iperf tests with the same Mac I used before to test against Banana Pi and was also able to get ~940 Mbits/sec in one and above 750 Mbits/sec in the other direction. So I have to further investigate what's wrong here, since I always get slower results with a machine connected directly to the same switch (HP 1810-8G v2) as the BPi/pcDuino compared to the superiour results the other machine achieves that is three hops (switches) away.

Another interesting thing was to measure the influence of DRAM clock. I tested the pcDuino with both 408MHz DRAM clock as well as 480MHz. Always two times using "iozone -a -g 2000m -s 2000m -i 0 -i 1 -r${recordsize}k":

With 480 Mhz DRAM and 1056MHz cpufreq:
  1. 2048000       4   41588   47520   218494   219156
  2. 2048000       4   41247   48200   218171   219174
  3. 2048000      32   40930   48094   196609   197547
  4. 2048000      32   41417   47526   198031   198522
  5. 2048000     512   40384   47232   192915   192401
  6. 2048000     512   42114   46416   192788   193241
  7. 2048000   16384   46025   44079   195094   195918
  8. 2048000   16384   43941   46504   222510   223501
Copy the Code
With just 408 Mhz DRAM and also 1056MHz cpufreq:
  1. 2048000       4   37368   43087   206034   208048
  2. 2048000       4   41708   43635   207130   207796
  3. 2048000      32   42540   42384   180647   179998
  4. 2048000      32   40230   42918   183590   184689
  5. 2048000     512   42601   42131   175569   175094
  6. 2048000     512   40822   43882   177408   176607
  7. 2048000   16384   38623   43127   203863   203676
  8. 2048000   16384   43976   44185   178184   178290
Copy the Code
But since figuring out at which frequency the DRAM still works reliable requires a lot of efforts I would stay with the manufacturer's defaults (408MHz on the pcDuino3 nano, 432MHz on Banana Pi/Pro)

You have to log in before you can reply Login | Sign Up

Points Rules