SAMBA CPU usage high

21 10826
Edited by tkaiser at Fri Apr 3, 2015 09:24
Suman replied at Thu Apr 2, 2015 08:29
The scaling_max_freq setting for some other linux computers I have access to ... So what does this setting actually mean ?  What does it do ?

Theyr're the upper limit of the cpufreq stuff. But it also depends on other things as well, eg. the CPU governor (when using fantasy then random frequencies will be used) or on x86 systems the (automagically filled in) limits in /sys/devices/system/cpu/cpuX/cpufreq/bios_limit

Regarding actual values written to scaling_max_frequency: when you're on Kernel 3.4 they're derived from the fex file (board initialisation) and which values are written therein is described here: (it's a dynamic process how maximum values evolved over time and on a SBC you might increase clocks where you aren't able when the same SoC is used inside a tablet due to different thermal challenges).

BTW: In this dynamic process errors were also made. The fex file for Cubieboards 2 and Cubietruck (both based on the A20 like the Bananas) had slightly too low voltage entries in the dvfs_table of the first Linux images for these boards. And what worked flawlessly in the original submitter's setup caused troubles for numerous users of these boards. So all voltage entries were incremented a bit after a while. The problem is: After the production of these cheap SoCs there isn't a selection process where they're tested for their limits and sold afterwards as different CPU types like it's the normal process with x86 (eg. these two Intel CPUs -- i7-5557U and i5-5257U -- were produced on the same wafer and only afterwards differentiated into two different 'CPU lines'). This test would be way too expensive so you've to do this yourself to get the real limits of the SoC in question (or stay save with community derived defaults).

Edited by tkaiser at Fri Apr 3, 2015 09:22
Suman replied at Thu Apr 2, 2015 13:02
Effectively 50 MB/s max. This should be the theoretical limit of how much data you can move too and fro from this board using GbE. And I already am able to move 37 MB/s using SAMBA in the read flow.

You can get even more. See the results with mainline kernel and some tuning (but 'only' 1056 MHz instead of 1200 MHz since the cpufreq stuff in 3.19 wasn't ready at that time).

But there are two different bottlenecks present: SATA performance (read faster than write) and GMAC/network performance (read slower than write). So if you're really after performance the best 'solution' (let's better call it workaround) on a cheap A20 based board would be to setup a RAID-0 (mdraid with kernel 3.4 and btrfs and a mixture of RAID-1 and RAID-0 with mainline kernel!) consisting of one SATA disk (45/200 MB/s write/read) and one USB disk (~30/~30 MB/s) since you get +60/+60 MB/s local storage performance afterwards (this might increase client-to-server throughout but also decrease potential througput in the other direction -- I was able to measure above 70 MB/s when reading from the Banana Pi when using mainline kernel and optimized settings)

I was able to do some more tests for this problem. By selective replacement of client and servers, I determined that my Sony Viao Laptop which was used as client PC was having a problem with the throughput on GbE interface and replaced it with  my desktop which sorted this issue out.

To summarize, 25 MB/s write and 40 MB/s read is where I am maxing out now. Details below:

Comparitive Test with FreeNAS Server:
iPerf gave a performnace of 117 MB/s against a  theoretical max of 125 MB/s.  This is a Intel Core i3-2100T (2.5 Ghz dual Core), 8 GB DDR3 1600, Asus H77-i Mobo, 1 GbE PC build.
I guess this high output (936 mbps) should the network troubleshooting chapter.

BPro Independent GbE Ethernet Test:
iPerf Throughput = 100 MB/s average. 94-106 MB/s variation band.

Also set the tried setting the Jumbo Frame (MTU=9000) on BanaPro, Router and Windows PC, but could not proceed further as Bpro networking daemon refused to start with 9000 MTU setting. Also otherwise i noticed some drop in pwerformnace with FreeNAS as server and therefore I aborted this Jumbo frame idea altogether.

The throughput also varies if for example another Wifi client is watching straemed video on the network iusing the same router. The above test was done when other cvlients on the network were hardly doing anything (relatively unloaded network).

BPro Independent SATA Disk Test
SATA Disk write performance was 40.5 MB/s (synchronous) and 43.5 MB/s asynchronous. Could not pick up a stable CPU usage reading due to wild fluctuations
SATA Disk read performance was 57 MB/s with 25-30% loading on Core 1

The SATA disk (Hitachi Travelstar 5k320-160) performnace was done with 'dd' tool as below:

dd if=/dev/zero of=filespoec bs=1M count=1024  // for asynchronous writes
dd if=/dev/zero of=filespec bs=1M count=1024 conv=fdatasync // for synchronous writes
dd if=filespec of=/dev/zero  bs=1M count=1024  // its always synchronous by default

This disk is SATA2, 2008 manufactured & spins at 5400 rpm. I have a relatively new (3 year old) WD enterprise disk (3.5”) which returns 95.2 MB/s asynchronous writes and 9.3 GB/s reads on my client PC booted into linux.

BPro Samba File Transfers Tests:
The above two should be able to establish that neither SATA disk despite being rather dated or the GbE network is a bottleneck in Samba Performnace test. That leaves mostly smbd process and anything else happenning in Bananian kernel when it runs. I used "htop" utility this time, and configured HELIOS LanTest to execute 1000 runs of only the 300 MB read and write teests.

Here are the latest test results of samba:

LanTest Read with default smb.conf = 34.2 MB/s. CPU usage (core1 = 93%, core2 = 6%)
LanTest Write with default smb.conf = 24.7 MB/s.  CPU usage (core1 = 27-77%, core2 = 4-15%)
LanTest Read with Tuned smb.conf = 40.3 MB/s. CPU Usage (core1 = 93-95%, Core2 = 5%)
LanTest Write with Tuned smb.conf = 24.9 MB/s. CPU usage (core1 = 25-78%, core2 = 5-16% )

I didn't as yet do any modifications on frequency scaling, governers or interrupt balance. Maybe that's next.

The tuned smb.conf had the following *extra* tuning options over the default file:

use sendfile = yes
aio read size = 16384
aio write size = 16384
read raw = yes
write raw = yes

A little digging of the samba design doc, also tells that smbd is not multithreaded. Samba server launches ONE process per client i.e. only one thread. Which means the CPU clock is a rate limiting factor on performnace that is achievable from one client as long as it is within the GbE throughput limits.  In my case only one smbd process is active and it is eating almost 100% of the A20 core.

Their also seems to be a lot of  dependency on network, client (Untuned SMB clients are about 25% slower than Windows Clients) which can make results vary drastically.

Also FreeNAS CIFS shares on Intel PC (it also internally runs samba) is able to reach 64% of its practical GbE throughput, but we can go only as much as 40% on Banana Pro with Linux.

Thanks for your sharings.

Regarding disk performance: I have no clue why anyone uses 'dd' for I/O benchmarks (since without "iflag/oflag direct" you test partially caches/buffers and bs=1M is not related to any 'real world' workload). I would always use iozone (and/or bonnie++) since it's interesting how storage performance varies when testing different record/block sizes.

Regarding your Samba throughput maximum: I would suggest reading an older thread here and especially this conclusion. Since silentcreek uses an USB disk simply adjusting IRQ affinity should further improve throughput in your setup (adjusting network settings for Gbit networks and cpufreq stuff of course as well).

Possibly ignorance (as in my case) to be aware of better tool options. Most searches on testing disk performance on linux lead to pages describing 'dd' and "hdparm' usage. Only of late Bonnie is hitting google searches. Besides we got used to the fact that disk r/w performance is very rarely an issue, unless you drive starts failing or something.

Anyways, i will try and use the two tools you suggested and check what they throw up. Here's what dd test profile looks with varying block sizes for write.

My 2.5" HDD Performance profile

My 2.5" HDD Performance profile

Edited by tkaiser at Tue Apr 7, 2015 01:52
Suman replied at Tue Apr 7, 2015 00:25
we got used to the fact that disk r/w performance is very rarely an issue, unless you drive starts failing or something.

Unfortunately not the case with A20/A10 since there seems to be a SATA write limitation (~45 MB/s maximum currently -- maybe this is a driver issue and will be fixed sometimes in the future).

And the problem with dd/hdparm (or to be more precise: 'test results' based on them published somewhere) aren't the tools itself but the variety of parameters you can call them. And while 'hdparm -T' is a useful tool to measure I/O relevant stuff in the OS' caches/buffers (anything but the disk) its results always show up in benchmark comparisons. Same with dd and inappropriate parameters.

As an example for misleading use of both commands: Chapter 4 in this 'benchmark comparison' that got referenced a lot:

It's obvious that the 'cached reads' don't say anything about real I/O performance at all. And that also applies to the 'write test' that has been done obviously with a test size that doesn't fit into Raspberry A's 256 MB RAM but partially or completely into the 512/1024 MB of Raspberry B and Banana Pi (since all these boards perform nearly identical when it's about sequential writes/reads to SD card: BPis and all A20 based boards max out at 16.x MB/s, the RPi is a bit faster with 17.x MB/s). So instead of graphs without any real meaning the simple sentence "we could consider normal SD operations are the same, on Banana or Raspberry" would've been enough. The wrong benchmark methods just produce numbers without meaning (not the case in your tests, it's just my usual suada against dd/hdparm )

Did a sample run with Bonnie++ for SATA performance locally on the B-Pro. Its benchmark results are not very different from what 'dd' reported.

Bonnie++ Reports:
    Write = 42.3 MB/s    // with 'dd' we got 40.5 MB/s in synchronous writes and 43.5 MB/s in asynchronous writes
    Read = 61.2 MB/s    // with 'dd' we got 57 MB/s    Usage: bonnie++ -d filespec -r 2048 -u sluthra

Just 4-7% variation.  5-10% variation in any performnace measure in any general purpose OS is usually present when you are trying to reproduce benchmarks.

HTS320-160 performance profile - Bonnie  .jpg

What do u say ? Perhaps 'dd' is giving a fair approximation of what performance to expect from the SATA disk.

Edited by tkaiser at Tue Apr 7, 2015 03:46
Suman replied at Tue Apr 7, 2015 02:15
Perhaps 'dd' is giving a fair approximation of what performance to expect from the SATA disk.

Might be the case in your special situation (where you tested with a file size of 1G with dd and also used the appropriate flags to get real disk I/O performance and not just partially caches/buffers).

When people normally use dd they suppress any further flags (you call this 'asynchronous writes' and I would call this 'buffers+disk') and this leads to results depending on the kernel (and filesystem implementation when not testing on raw devices since buffer/cache strategies might change). Because results depend on caching strategies depending on available RAM size (might differ between test runs depending on other tasks allocating more or less RAM) as well as RAM size as well as test file size:
  1. root@bananapi:/sata# dd if=/dev/zero of=/sata/evo840/2048M bs=1M count=2048
  2. 2048+0 records in
  3. 2048+0 records out
  4. 2147483648 bytes (2,1 GB) copied, 49,9121 s, 43,0 MB/s

  5. root@bananapi:/sata# dd if=/dev/zero of=/sata/evo840/1024M bs=1M count=1024
  6. 1024+0 records in
  7. 1024+0 records out
  8. 1073741824 bytes (1,1 GB) copied, 22,7586 s, 47,2 MB/s

  9. root@bananapi:/sata# dd if=/dev/zero of=/sata/evo840/512M bs=1M count=512
  10. 512+0 records in
  11. 512+0 records out
  12. 536870912 bytes (537 MB) copied, 10,3897 s, 51,7 MB/s

  13. root@bananapi:/sata# dd if=/dev/zero of=/sata/evo840/256M bs=1M count=256
  14. 256+0 records in
  15. 256+0 records out
  16. 268435456 bytes (268 MB) copied, 3,91015 s, 68,7 MB/s
Copy the Code
Misleading/unrealiable stuff like the latter test (256 MB vs. 1 GB RAM) will then be published using a raw device somewhere on the net as 'raw SATA write speed of this device'. And that's the problem with dd. You can't trust the results you find unless you exactly know how the tester called it. Tools like bonnie++ keep that in mind, use an appropriate test file size and show a warning if you try to test the wrong things.

And with dd you will also get easily in rewrite situations where you wanted to test write performance (normally that doesn't matter that much but it's important if you use modern filesystems like btrfs/ZFS and block sizes that are smaller than the block size of the fs in question -- please compare with the results at the end of this post). Dedicated benchmark tests prevent you from making this mistake since they remove their test files after successful completion and do separate write/rewrite tests on their own.

In other words: It's very easy to call dd/hdparm wrong so you can't rely on the results published by someone. Real benchmark tools like iozone/bonnie++ take care of that -- they even report CPU usage while running the test (very important on slow platform like SBCs -- I find it usefull to run an 'iostat 5' in a different terminal in parallel to get a real clue what's going on). And in my opinion it's more convenient to let bonnie++ create a CSV file containing the necessary count of test runs where you can lookup strange issues later (since not only performance but also boundary conditions are written) and you get way more information than just sequential transfer speeds.
  1. bonnie++ -d /sata/evo840 -m"EVO 840 btrfs" -x 10 -q 2>/tmp/bonnie-stderr >/tmp/bonnie-results.csv
Copy the Code

LANTest report with varying file size size:

Lantest Varyng file size.jpg
If we write file size of 3GB and 12 GB (my earlier test file size was 300 MB), then the write throughput increases almost till 35 MB/s. Earlier we are stuck with around 25-28 MB/s average on the write flow. The CPU usage is also higher (almost 99-100%) with bigger files.
Also for a 3 GB file write, HELIOS LANTest reports the similar transfer rate as Windows SMB Client (throough a regular file copy operation). Their was a difference earlier between Windows CPU and Lantest. So the windows copy is also not a grossly iincorrect measurement about the throughput.

Suman replied at Tue Apr 7, 2015 09:23
LANTest report with varying file size size:

It's more about record sizes than file sizes -- please see below.

Regarding the relationship between the 'synthetic' LanTest benchmark and Windows Explorer (and also network performance when accessing servers from within applications!) please have a look at There's also explained what also changes when you adjust test file sizes (the record size which is responsible for different results as long as the test file size is way larger than buffers/caches of the smbd process in question).

And please keep in mind that I've been able to reach 37/73 MB/s: (tests done with only 1056 MHz cpufreq... from OS X using AFP but I doubt that it's not possible to get similar results from Windows using SMB/Samba)

You have to log in before you can reply Login | Sign Up

Points Rules