NAS

BananaPi as a fileserver -- some personal thoughts and experiences.

12 21644
tkaiser  
Edited by tkaiser at Mon Feb 9, 2015 06:35

The basics

The Banana Pi is a small single board computer. Both its name and board layout might suggest it's compatible with the well known Raspberry Pi but that's definitely not the case. At its heart is a different SoC (system on a chip) which features a different GPU than the RasPi and more importantly contains both SATA and GBit Ethernet. The RasPi lacks both and its Ethernet chip is connected to an internal USB hub so all USB ports and the network adapter share the bandwidth of the single USB port the RasPi's BCM2835 SoC provides.

Ethernet as well as SATA connector of the Banana Pi's A20 SoC are connected directly and not via USB. They're able to negotiate at GBit link speed and SATA II (Serial ATA 3,0 Gbit/s, SATA Revision 2.x) but neither CPU power nor hardware features are sufficient to reach the theoretical maximum of both interfaces. Typical SATA throughput without extensive tuning is in the range of approx. 40/200 MB/sec and GMAC's network speed between 470/550-700 Mbits/sec (write/read -- with mainline kernel it seems to be possible to reach 700/940 Mbits/sec). If both interfaces are in use concurrently as it will happen most of the time in a file server setup then additional performance decreases will occur since chipset limitations and CPU power become a bottleneck.

The A20 SoC features 2 USB 2.0 (EHCI/OHCI) ports as well as an USB 2.0 OTG (USB On-the-go) Micro USB connector. All three ports are connected directly to the A20 SoC and can achieve real-world read/write speeds of approx. 30 MB/sec (this is a hard limitation due to USB 2.0's USB Mass Storage Bulk-Only Transfer). If you share USB disks over the network expect lower speeds since there's always some overhead.

CPU frequency stuff

The A20 SoC can be clocked in a wide range (officially between 60 - 1008 MHz -- less MHz means less needed voltage). Clock speed has a direct impact on performance. There exist also different policies how to adjust clock speed dynamically depending on load. If you always need high performance you should choose 'performance'. The drawback is that the CPU will always clock at the upper allowed level defined in /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq even when there's nothing to do at all. A better idea is to use the 'interactive' or 'ondemand' governors which dynamically clock between scaling_min_freq and scaling_max_freq depending on the load generated (you have to look yourself which governor fits your needs best to balance performance with power consumption)

You will need the cpufrequtils package and the governor available in the kernel configuration (eg. 'CONFIG_CPU_FREQ_GOV_ONDEMAND=y' -- compare with 'zcat /proc/config.gz'). Based on my tests slight overclocking is both possible and desirable (use a heatsink and ensure enough air flow):
  1. echo -n 1200000 >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  2. echo -n ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  3. echo 600000 > /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq
  4. echo 25 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold
  5. echo 10 > /sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
  6. echo 1 > /sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
Copy the Code
Statistics are available below /sys/devices/system/cpu/cpu0/cpufreq/stats/

Heat dissipation

It's important that the relevant parts of the Banana Pi stay cool since the components try to prevent overheating dynamically through lower voltage and clock speeds. So think about using heatsinks for both CPU and the Power Management Unit AXP209. And consider mounting the board vertically to ensure enough air flow. Utilizing the 'chimney effect' might also be a good idea.

The BananaPi's AXP209 PMU has an integrated thermal sensor which can be read (degree Celsius) using
  1. awk '{printf ("%0.1f",$1/1000); }' </sys/devices/platform/sunxi-i2c.0/i2c-0/0-0034/temp1_input
Copy the Code
The A20 SoC itself contains also an internal temperature sensor but it's somewhat difficult to read and interpret the uncalibrated values:

http://www.cubieforums.com/index.php/topic=2493.0
http://www.cubieforums.com/index.php?topic=2293.0

Update: Community member FPeter provided a better approach: http://forum.lemaker.org/forum.php?mod=redirect&goto=findpost&ptid=8137&pid=47437

Update: A short overview including an archive with modified files to use RPi-Monitor on the BPi can be found here: http://forum.lemaker.org/forum.php?mod=redirect&goto=findpost&ptid=8312&pid=38582

SMP challenges

Containing a dual core CPU the BananaPi's A20 has both more power as well as more problems compared to a single core implementation: How to assign tasks to the specific CPU cores? Do a web search for 'SMP affinity' and 'IRQ balancing' for details. In short: When all interrupts are handled by one CPU core (usually CPU0) then the CPU might become a bottleneck for network throughput. An approach to this is IRQ balancing: Evenly processing all interrupts on different CPU cores in a round robin fashion. Since this doesn't work quite well in many situations (especially network IRQs) there also exist alternatives like manually controlling the SMP affinity of specific IRQs as well as processes (the latter can be done using the 'tasksel' utility)

The simplest solution on the BananaPi is to assign all network related interrupt processing to CPU1 by setting a specific SMP affinity for this IRQ (eg. in /etc/rc.local):
  1. echo 2 >/proc/irq/$(awk -F":" '/eth0/ {print $1}' </proc/interrupts)/smp_affinity
Copy the Code
Further reading:

https://lkml.org/lkml/2012/8/4/51
http://comments.gmane.org/gmane.linux.ports.arm.kernel/102251
https://groups.google.com/forum/#!topic/linux.kernel/pNyi-qX9uz8
http://www.alexonlinux.com/why-i ... t-such-a-good-thing

TCP/IP settings

The default values of many TCP/IP tunables aren't optimal for GBit network speeds. Increasing buffer sizes and queue lenghts helps in most GBit LAN scenarios (please be aware that the following settings might decrease performance on network devices with high latency and low bandwidth)
  1. sysctl -w net/core/rmem_max=8738000
  2. sysctl -w net/core/wmem_max=6553600
  3. sysctl -w net/ipv4/tcp_rmem="8192 873800 8738000"
  4. sysctl -w net/ipv4/tcp_wmem="4096 655360 6553600"
  5. sysctl -w vm/min_free_kbytes=65536
  6. ip link set eth0 txqueuelen 10000
Copy the Code
Scheduler settings and I/O priority

Setting both CONFIG_SCHED_MC=y and CONFIG_SCHED_SMT=y at kernel compile time seems to increase the possible throughputs on a multi-core system like the BananaPi. In case you plan to use a NAS that will neither be used interactively nor concurrently by different users then you might get a performance boost 'at no additional cost' by adjusting the scheduler priority / ionice settings of the processes serving the single network client. You have to get the process ID (PID) of the single process (eg. smbd running under the UID of the client user in question) and could then do a
  1. ionice -c1 -p $PID
Copy the Code
Jumbo frames / MTU

While it seems possible to use a MTU of up to 3838 bytes and this really helps using synthetic benchmarks like iperf I didn't managed to get normal network loads stable afterwards and therefore returned to the 'traditional' MTU of 1500. Would be nice if others share their experiences.

File systems

For dedicated NAS storage ext4 seems to be the best choice on SATA/USB. Other file systems lack features (for example xattr, ACL, TRIM support) or are problematic in one way or another. XFS on ARM might lead to data loss if not done right at kernel compile time and while btrfs might seem to be an interesting choice there are two problems associated with it: btrfs heavily depends on the kernel version in use (and at the time of this writing all Banana distros use an outdated 3.4.x kernel) and since it's a checksum based file system with 'end to end data integrity' in mind it must not be used on devices that lack ECC RAM (since intact data on disk might get corrupted while scrubbing due to bit flips in RAM)

Partition alignment

While it's always a good idea to ensure proper partition alignment (taking the drive's sector sizes and Erase Block Size of SSDs into account) you most likely won't see any difference in performance when doing wrong since the A20's SATA implementation or USB's BOT (bulk only transfers) will be the bottleneck.

SATA Port Multipliers

Some people report that they work under certain circumstances. Follow this thread please: http://forum.lemaker.org/thread-9207-1-1.html (links might work or not since the LeMaker guys rearrange the subforums every few days)

Benchmarking

When you do benchmarking then always 'from bottom to top': Measure storage performance and network throughput individually first and only if you did so measure combined throughput. To have a look what's going on behind the scenes (CPU core utilisation and the like) use "htop" and eg. "dstat -cdnpmgs --top-bio --top-cpu --top-mem".

Good benchmarking tools are eg. iozone/bonnie++ to test local/remote storage and eg. iperf/netperf to measure network speeds without storage interaction. These tools provide switches to adjust parameters like record/block/window sizes that might help to fine tune server settings. And they correctly disable caches (one of the main mistakes people make when using eg. 'dd' for tests: measuring not solely disk throughput but mainly buffers/caches instead)

To measure the 'setup as a whole' from a client while considering different performance relevant parameters (not just throughput) I personally prefer Helios' LanTest. Available for free from here: http://webshare.helios.de (user tools, password tools).

Drive health / temperature

Using SATA for storage is not only faster than USB but provides more ways to get health feedback from the drive (this might work with some USB enclosures/bridge-chips as well but with many definitely not).

Using the smartmontools package one can start offline self-tests of the drive or SSD and also read different SMART parameters from the drive (either manually with smartctl or using a special daemon called smartd -- compare with the manual pages). SMART parameters are drive (manufacturer) dependent so you always have to ensure to get the most recent version of smartmontools drivedb.

You should monitor the specific health indicators that apply to your drive/SSD (reallocated sectors, wear leveling count for example) and should always have a look at parameter #199 if you add/change a drive. If the value is above 0 or increases when you try to write to the drive then something's wrong with cabling/connection. Have a look at the well known SMART attributes and their meaning here:

http://en.wikipedia.org/wiki/S.M ... M.A.R.T._attributes

Unfortunately the Debian Wheezy smartmontools package is outdated as hell so that one has to patch the update-smart-drivedb prior to first usage -- do a web search for 'update-smart-drivedb wheezy' to get an idea what needs to be changed.

The same applies to the nice hddtemp package. Most modern drives will be missing from the drive database. But it's easy to add your own drive. Run update-smart-drivedb (and fix it if it complains as outlined above), use 'smartctl -a /dev/sda' to read all available SMART parameters/values (in case of my SSD the interesting parameter is #190: currently reading 24°C):
  1. Device Model:     Samsung SSD 840 EVO 120GB
  2. 190 Airflow_Temperature_Cel 0x0032   ...   24
Copy the Code
Then check the exakt name pattern hddtemp needs by doing a
  1. hddtemp /dev/sda --debug
Copy the Code
(in my case this will output "Samsung SSD 840 EVO 120G B" with a space between 120G and B). Then simply add a new line to /etc/hddtemp.db in the following form :
  1. "Samsung SSD 840 EVO 120G B" 190 C "Samsung SSD 840 EVO 120GB"
Copy the Code
Afterwards 'hddtemp /dev/sda' should simply work.

Final thoughts

For a device that cheap the network throughput as a NAS device is fairly good when configuration was done right and the components have been chosen wisely (SATA instead of USB, network infrastructure and so on).

But since it lacks ECC RAM bit rotting will happen over time. This problem can only be adressed using checksum based filesystems that provide 'end to end data integrity' and server grade hardware featuring at least simple ECC memory.

For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with FIT rates (failures in time per billion device hours) of 25,000 to 70,000 per Mbit and more than 8% of DIMMs affected per year.

http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

Rate

1

View Rating Log

Wahhh, thanks a lot for documenting here. I think I read all your posts in the Bananian forum but this is a great knowledge perfectly summarized.

Thanks again

tkaiser  
Edited by tkaiser at 2014-10-15 02:32

The idea behind this 'article' was to collect some feedback from other users which can then be compiled into a wiki page. BTW: I forgot one section:

Great job!

tkaiser  
Edited by tkaiser at 2014-10-22 00:50

A small footnote regarding CPU governors:

I compared ondemand with interactive using a mixed network load.

While ondemand provided more snappiness (things like directory enumeration, open/close calls) interactive performed slightly better with throughput. Scaling was done between 600-1200 MHz (with ondemand most of the time at 1200 MHz and interactive only sometimes at the peak and most of the time in between or at the lower level).

I used the thermal sensors of the SoC, the AXP209 PMU and the SSD and graphed it using RPi-Monitor. Temperature is a direct indicator of power consumption and here interactive clearly wins (ondemand test between 12:46:30-12:51, interactive between 12:53-12:57:30):



The ondemand results:



The interactive results:

tkaiser  
Edited by tkaiser at 2014-10-22 00:51

Another follow-up regarding monitoring (I simply use RPi-Monitor http://rpi-experiences.blogspot.fr/p/rpi-monitor.html because it's both fully customizable and lightweight).

After altering cpu.conf I can graph CPU clock speed and internal power consumption of the AXP209 PMU (PSU power dissipation as well as SATA power not included). I did some iperf testing between the BananaPi and a Mac Mini with interactive governor:

Throughput:


Temperatures:


CPU frequency (right axis) and PMU internal power consumption (left axis):


My /etc/rpimonitor/template/cpu.conf now reads: http://pastebin.com/hCBW0Hrz which results in this status overview of RPi-Monitor on Banana Pi:

(for the temperature stuff have a look above or search the forum)

T.S.  
In case I have missed anything but where are this very interesting postings from tkaiser are moved to?

T.S. replied at 2014-10-16 14:41
In case I have missed anything but where are this very interesting postings from tkaiser are moved t ...

Not that I'm aware of...
The posts seem to be modified by him, so I've been asking him to restore them if possible...

Post fixed! Thx!

From your valued analysis I would like to point to the following details:

TCP/IP and Scheduler settings currently are not pivotal because disk writing speed has a bottleneck saturating at 40 MB/s.

Jumbo frames might be interesting in the future. However going from 1500 to the double would not make too big difference. The Ethernet chip allows huge frame sizes and imho it looks like a bug in the Ethernet driver that bigger MTUs do not reliably work.

Health status: Checking the supply voltages would be interesting.

Lack of ECC RAM: Assuming 8% of DIMMs are affected by year, with 2 chips an average of more than 5 years until a failure happens can be expected. This most probably is lower than other glitches on such a board. If reliability of RAM is of concern, intensive RAM testing upon purchase is advisable.

You have to log in before you can reply Login | Sign Up

Points Rules