NAS

First impressions with cpufreq stuff in Kernel 4.x

7 4846
tkaiser  
Edited by tkaiser at Thu Jun 4, 2015 03:24

Using Armbian (Igor's image) I started experimenting with the cpufreq stuff the last days.

Test environment: Banana Pi, Kernel 4.0.4, u-boot 2015.4, CPU governor "performance", DRAM overclocked to 480 MHz.

I added additional operating points to arch/arm/boot/dts/sun7i-a20.dtsi in kernel source:
  1.                         operating-points = <
  2.                                 /* kHz    uV */
  3.                                 1200000 1500000
  4.                                 1152000 1500000
  5.                                 1104000 1450000
  6.                                 1056000 1450000
  7.                                 1008000 1450000
  8.                                 960000  1400000
  9.                                 912000  1400000
  10.                                 864000  1300000
  11.                                 720000  1200000
  12.                                 528000  1100000
  13.                                 312000  1000000
  14.                                 144000  900000
  15.                                 >;
  16.                         #cooling-cells = <2>;
  17.                         cooling-min-level = <0>;
  18.                         cooling-max-level = <6>;
Copy the Code
Now it looks like this:
  1. cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies
  2. 144000 312000 528000 720000 864000 912000 960000 1008000 1056000 1104000 1152000 1200000
Copy the Code
The disk used is a Samsung 840EVO with 128GB and connected directly to the SATA port. Mounted with btrfs (to detect data corruption easily):
  1. mount -v -t btrfs -o noatime /dev/sda /sata
Copy the Code
I tried the combination 1200 MHz CPU and 480 MHz DRAM also but that didn't worked reliable (defaults with kernel 4.x and u-boot 2015.4: 960 MHz CPU and 432 MHz DRAM)

Tests:

  • sysbench --test=cpu --cpu-max-prime=5000 run --num-threads=2
  • 7zr b
  • iozone -a -g 2000m -s 2000m -i 0 -i 1 -r${recordsize}k
  • iperf -c / iperf -s


CPU/memory results:

CPU 960 MHz:
  1. sysbench: 53.8095/0.00
  2. 7-zip:

  3. Dict        Compressing          |        Decompressing
  4.       Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
  5.        KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

  6. 22:     558   149    364    543  |    14209   200    642   1282
  7. 23:     549   151    370    560  |    14016   200    642   1283
  8. 24:     546   154    380    587  |    13816   200    641   1282
  9. 25:     537   157    391    613  |    13609   200    640   1280
  10. ----------------------------------------------------------------
  11. Avr:          153    376    576               200    641   1282
  12. Tot:          176    509    929
Copy the Code
CPU 1056 MHz:
  1. sysbench: 48.9071/0.00
  2. 7-zip:

  3. Dict        Compressing          |        Decompressing
  4.       Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
  5.        KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

  6. 22:     601   150    389    585  |    15558   200    703   1404
  7. 23:     592   153    395    603  |    15351   200    703   1405
  8. 24:     586   156    404    630  |    15108   200    702   1402
  9. 25:     580   160    415    662  |    14844   200    699   1396
  10. ----------------------------------------------------------------
  11. Avr:          155    401    620               200    702   1402
  12. Tot:          177    551   1011
Copy the Code
CPU 1104 MHz:
  1. sysbench: 46.7909/0.01
  2. 7-zip:

  3. Dict        Compressing          |        Decompressing
  4.       Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
  5.        KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

  6. 22:     625   151    402    608  |    16255   200    734   1467
  7. 23:     614   153    408    625  |    15997   200    733   1464
  8. 24:     610   157    417    656  |    15773   200    732   1463
  9. 25:     604   161    429    690  |    15523   200    730   1460
  10. ----------------------------------------------------------------
  11. Avr:          156    414    645               200    732   1464
  12. Tot:          178    573   1054
Copy the Code
CPU 1152 MHz:
  1. sysbench: 44.8551/0.01
  2. 7-zip:

  3. Dict        Compressing          |        Decompressing
  4.       Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
  5.        KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

  6. 22:     645   152    412    627  |    16931   200    765   1528
  7. 23:     636   155    419    648  |    16675   200    764   1526
  8. 24:     629   158    427    676  |    16457   200    763   1527
  9. 25:     623   162    439    712  |    16147   200    760   1518
  10. ----------------------------------------------------------------
  11. Avr:          157    424    666               200    763   1525
  12. Tot:          178    594   1095
Copy the Code
All following tests with CPU clock set to 1152 MHz, DRAM still at 480 MHz:

I/O measured with iozone:
  1.               KB  reclen   write rewrite    read    reread
  2.          2048000       1   33771   12970   171630   171679
  3.          2048000       2   40478   14419   190772   189638
  4.          2048000       4   47518   46912   173765   174739
  5.          2048000      32   47427   47368   168773   168477
  6.          2048000     512   46646   46507   164693   164179
  7.          2048000   16384   46700   46400   168031   168297
Copy the Code
Network measured with iperf (10 second runs):

RX (client to Banana Pi):
  1. [  4]  0.0-10.0 sec  1.09 GBytes   937 Mbits/sec
  2. [  5]  0.0-10.0 sec  1.09 GBytes   940 Mbits/sec
  3. [  4]  0.0-10.0 sec  1.09 GBytes   940 Mbits/sec
  4. [  5]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
  5. [  4]  0.0-10.0 sec  1.09 GBytes   935 Mbits/sec
  6. [  5]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
  7. [  4]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
  8. [  5]  0.0-10.0 sec  1.09 GBytes   940 Mbits/sec
  9. [  4]  0.0-10.0 sec  1.10 GBytes   940 Mbits/sec
  10. [  5]  0.0-10.0 sec  1.10 GBytes   941 Mbits/sec
Copy the Code
TX (Banana Pi to client):
  1. [  4]  0.0-10.0 sec   673 MBytes   564 Mbits/sec
  2. [  4]  0.0-10.0 sec   678 MBytes   568 Mbits/sec
  3. [  4]  0.0-10.0 sec   804 MBytes   674 Mbits/sec
  4. [  4]  0.0-10.0 sec   878 MBytes   737 Mbits/sec
  5. [  4]  0.0-10.0 sec   819 MBytes   687 Mbits/sec
  6. [  4]  0.0-10.0 sec   974 MBytes   817 Mbits/sec
  7. [  4]  0.0-10.0 sec   771 MBytes   646 Mbits/sec
  8. [  4]  0.0-10.0 sec   691 MBytes   579 Mbits/sec
  9. [  4]  0.0-10.0 sec   922 MBytes   773 Mbits/sec
  10. [  4]  0.0-10.0 sec   883 MBytes   740 Mbits/sec
Copy the Code
Don't try this at home unless you really took care about overheating. Nearly all available enclosures for Banana Pi/Pro are simply crap due to bad thermal design (DRAM, SoC and PMU are on the lower side of the PCB). Overclocking will only work if you took special precautions (mounting heatsinks, operating the board vertically with enough airflow possible). And you shouldn't use the performance governor for normal usage. While it's necessary for benchmarking (since you will get random results otherwise) normally one should use the ondemand governor with 4.x so inclreasing clock speeds happens only when really needed. Put something like this to /etc/rc.local:
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 1152000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
Copy the Code
Next step will be to try to increase voltage when operating at 1.2 GHz and different RX/TX delay settings since my aim is to use Banana Pi in NAS scenarios. So since SATA write throughput is still limited to approx. 47 MB/s with the aforementioned overclocking settings I will try to get network settings where RX might decrease as long as TX performance improves and becomes more stable. I've setup an automated test environment trying out automatically different u-boot settings (where network initialisation happens partly) using Armbian: http://forum.armbian.com/index.p ... with-gigabit/?p=150

And this is where discussion should move: http://forum.armbian.com

Rate

1

View Rating Log

tkaiser  
Edited by tkaiser at Thu Jun 4, 2015 04:30

Follow-up regarding 4.0.x cpufreq defaults:

They're set as follows:
  1. /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor = ondemand
  2. /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq = 960000
  3. /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq = 144000
  4. /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy = 0
  5. /sys/devices/system/cpu/cpufreq/ondemand/up_threshold = 95
  6. /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor = 1
Copy the Code
Both scaling_min_freq as well as the last 3 settings have a huge impact on NAS performance. For better results the minimum cpufreq should be increased (makes small to no difference regarding power consumption or thermal issues) and the last 3 settings have to be adjusted as outlined in the linux-sunxi Wiki (especially io_is_busy is important!)

Simple conclusion: If ondemand settings are adjusted accordingly there's nearly no need to overclock when it's about NAS useage

And for an overall better real-world performance one could use btrfs' transparent filesystem compression which further improves performance a bit (mount options '-o noatime,compress=lzo')

Results:
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 960000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. (otherwise defaults)
Copy the Code
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 1152000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. (otherwise defaults)
Copy the Code
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 960000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. echo 1 > /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
  5. echo 25 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
  6. echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
Copy the Code
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 1152000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. echo 1 > /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
  5. echo 25 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
  6. echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
Copy the Code
  1. echo performance >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 1152000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
Copy the Code

tkaiser  
One final note regarding the benefits of network settings and process scheduling on ultra slow boards like Banana Pi...

After applying some TCP/IP tuning and adjusting CPU affinity of the 3 processes responsible to serve NAS requests from my MacBook Pro (details here) I ended up with performance governor at 45 MB/s write and above 70 MB/s read through the network:



Even with conservative settings and ondemand governor without any overclocking the results are way better than when you just use kernel 4.x and Debian defaults:
  1. echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 960000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 528000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. echo 1 > /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
  5. echo 25 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
  6. echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
  7. sysctl -w net/core/rmem_max=8738000
  8. sysctl -w net/core/wmem_max=6553600
  9. sysctl -w net/ipv4/tcp_rmem="8192 873800 8738000"
  10. sysctl -w net/ipv4/tcp_wmem="4096 655360 6553600"
  11. sysctl -w vm/min_free_kbytes=65536
  12. sysctl -w net.ipv4.tcp_window_scaling=1
  13. sysctl -w net.ipv4.tcp_timestamps=1
  14. sysctl -w net.ipv4.tcp_sack=1
  15. sysctl -w net.ipv4.tcp_no_metrics_save=1
  16. [do some taskset magic with the processes in question]
Copy the Code


But you should always check yourself whether these settings improve performance or not. Depending on the clients used (no Windows here ) the situation might be different. Real-world tests transferring huge chunks of data should be the first thing to test.


tkaiser  
Edited by tkaiser at Mon Jun 29, 2015 11:23

I made a small comparison with LeMaker's 'Raspbian For BananaPi v1412' image:

The main differences compared to a NAS-friendly setup with Mainline kernel:
  • LeMaker still uses somewhat weird cpufreq settings ('fantasy' governor and clock speed between 720 and 912 MHz)
  • Kernel 3.4.103 instead of 4.x
  • Since the Raspbian image uses Raspbian repositories that were made for the real Raspberry Pis code of both applications and libraries is not ARMv7 optimised (but ARMv6 instead)
  • Memory reservation for GPU (MemTotal 895380 kB according to /proc/meminfo). This doesn't influenced the benchmarks but might make a difference in real-world scenarios since the less RAM available the less RAM can be used for file caches/buffers
  • Since btrfs in kernel 3.4 wasn't ready for prime time I had to rely on ext4 for tests


Result's with LeMaker's default settings:

I used my usual setup (connecting an EVO 840 SSD to the Banana Pi's SATA port) and the usual iozone/iperf/LanTest approaches (see this thread or other pinned threads in this subforum):

iozone -a -g 2000m -s 2000m -i 0 -i 1 -r4k:

default fantasy 720/912 MHz: 42MB/s write, 142MB/s read
optimised settings (ondemand 1056 MHz): 43MB/s write, 175MB/s read

iperf between BPi and MacBook Pro:

default fantasy governor (720-912 MHz): 400 Mbits/sec TX, 650 Mbits/sec RX
optimised settings (ondemand 1056 MHz): 425 Mbits/sec TX, 780 Mbits/sec RX

With "Optimised settings" I refer to these (adjusting cpufreq settings, assigning eth0 interrupts to the second CPU core and some TCP/IP stack tuning for Gbit networking):

  1. echo ondemand > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  2. echo 1056000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3. echo 408000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. echo 25 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
  5. echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
  6. echo 1 > /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
  7. echo 2 >/proc/irq/$(awk -F":" '/eth0/ {print $1}' </proc/interrupts)/smp_affinity
  8. sysctl -w net/core/rmem_max=8738000
  9. sysctl -w net/core/wmem_max=6553600
  10. sysctl -w net/ipv4/tcp_rmem="8192 873800 8738000"
  11. sysctl -w net/ipv4/tcp_wmem="4096 655360 6553600"
  12. sysctl -w vm/min_free_kbytes=65536
  13. sysctl -w net.ipv4.tcp_window_scaling=1
  14. sysctl -w net.ipv4.tcp_timestamps=1
  15. sysctl -w net.ipv4.tcp_sack=1
  16. sysctl -w net.ipv4.tcp_no_metrics_save=1
  17. sysctl -w net.core.netdev_max_backlog=5000
  18. ip link set eth0 txqueuelen 10000
Copy the Code

Please remember: With mainline, optimised kernel settings, slight overclocking and software optimised for ARMv7 we were able to get 750 Mbits/sec TX, 940 Mbits/sec RX with iperf and with iozone approx. 47MB/s SATA write througput (using exactly the same Banana Pi with exactly the same SSD connected and nearly the identical 'tunables' above).

In combined benchmarks that utilise both network and storage at the same time results differ even more:

"Raspbian for Banana Pi" defaults (28MB/sec write, 45MB/sec read):



"Raspbian for Banana Pi" with optimised settings and cpufreq of 960MHz (32MB/sec write, 45MB/sec read):



"Raspbian for Banana Pi" with optimised settings and cpufreq of 1056MHz (34MB/sec write, 51MB/sec read):




Using Mainline, ARMv7 optimised binaries and an Armbian image we were able to get over 15MB/s (write) and almost 30MB/s (read) more than with Raspbian's default settings. On identical hardware. This shows how important correct settings are on such slow platforms like small A20 based devices.

Redwid  
Hi tkaiser, do you mind to share your armbian main line kernel with overclocking on mode?

tkaiser  
Edited by tkaiser at Mar 18, 2016 05:01
Redwid replied at Mar 03, 2016 07:36
Hi tkaiser, do you mind to share your armbian main line kernel with overclocking on mode?

Of course I won't. The idea to 'overclock' an A20 device with untested settings is already a bit insane since these 'tablet grade' SoCs aren't tested at the factory. Then increasing the cpufreq isn't the most important stuff and choosing a SoC that is made for NAS useage is the best idea.
Please read carefully through http://linux-sunxi.org/Hardware_ ... Ffrequency_settings and http://linux-sunxi.org/Sunxi_dev ... _on_NAS_performance to get the idea what has to be considered when trying to increase clockspeeds and why replacing the kernel in an otherwise crappy OS image won't help.

Replacing one kernel with another is always wrong, especially since Armbian provides a fully functional build system. So set up a virtual machine with Ubuntu 16.04 LTS, follow the steps outlined here http://www.armbian.com/using-armbian-tools/ throw the following at the right place in the 'userpatches' dir and you can do any harm to your A20 device that you want (but you're still able to adjust the DVFS settings on your own since the Armbian build system creates simple to install .debs with your settings):


  1. diff --git a/arch/arm/boot/dts/sun7i-a20.dtsi b/arch/arm/boot/dts/sun7i-a20.dtsi
  2. index e02eb72..7266c9a 100644
  3. --- a/arch/arm/boot/dts/sun7i-a20.dtsi
  4. +++ b/arch/arm/boot/dts/sun7i-a20.dtsi
  5. @@ -102,6 +102,11 @@
  6.                         clock-latency = <244144>; /* 8 32k periods */
  7.                         operating-points = <
  8.                                 /* kHz    uV */
  9. +                                1200000 1500000
  10. +                                1152000 1500000
  11. +                                1104000 1450000
  12. +                                1056000 1450000
  13. +                                1008000 1450000
  14.                                 960000  1400000
  15.                                 912000  1400000
  16.                                 864000  1300000
Copy the Code
BTW: Please don't expect that I read answers in this useless forum, I stumbled accross your post more by accident and will not get back to any answers anytime soon

@tkaiser any patches for 4.6.x kernel? I can't get it work for banana pi m1

Redwid  
tkaiser replied at Mar 18, 2016 05:00
Of course I won't. The idea to 'overclock' an A20 device with untested settings is already a bit in ...

Thanks for patch file.

You have to log in before you can reply Login | Sign Up

Points Rules