NAS

Reliable temperature monitoring

56 24122
tkaiser  
Edited by tkaiser at Sun Nov 9, 2014 15:40

I changed my measurement setup a bit. Since I wanted to have a look how an A20 behaves with and without a heatsink I had to use my A20-Lime2. So I changed my temp-daemon to measure internal temperatures directly on the Lime2 and to transmit them over network to the Banana Pi connected to the 2 DHT22 (for ambient and surrounding temperature in an enclosure)

I started the usual 'stress test' (stress -t 900 -c 2 -m 2 -i 2 -d 2) with only bit 4 set in the TP_CTRL1 register on the A20-Lime2 without a heatsink:
  1. echo 'f1c25004:10' > /sys/devices/virtual/misc/sunxi-dbgreg/rw/write;
Copy the Code


Then at 18:22 I also set the CHOP_TEMP_EN bit to 1 ("echo f1c25004:90") and immediately the temperatures read decreased by a few degree (in fact a few degrees too much since I had a manual correction before -- just realized at the time I changed it).

I investigated a bit further and made different test runs (using the stress test and increasing scaling_max_freq in 96 MHz steps) on the A20-Lime2 with both bit 7 set to 1 or not and with an heatsink applied to the A20 or not. The differences depending on CHOP_TEMP_EN being enabled or disabled weren't that high compared to my 2nd Banana Pi a few days ago (25°C lower with heatsink on the A20). But the good news is: With bit 7 set to 1 the base temperature values seem to be more precise. But enabling/disabling CHOP_TEMP_EN just means that the 'base temperature' increases/decreases linearly. I compared a few runs at different clock speeds and when ambient temperature is considered when doing the math the difference was always just 4°C:
  1.     idle: 42-26 = 16° | 37-25 = 12° | 4°
  2.  816 MHz: 49-26 = 23° | 44-25 = 19° | 4°
  3.  912 MHz: 51-26 = 25° | 46-25 = 21° | 4°
  4. 1008 MHz: 52-26 = 26° | 47-25 = 22° | 4°
  5. 1104 MHz: 53-26 = 27° | 48-25 = 23° | 4°
Copy the Code
Unfortunately the values the A20 reports with CHOP_TEMP_EN enabled ("echo f1c25004:90") seem to be a few degress too low. At 18:23/18:25 and at 18:54/18:58 I pressed my thumb/pinkie on A20 and AXP209 and the temperatures dropped immediately as expected. But too much since they the fell below my body temperature which is impossible. So expect the temperatures to be a few degrees off in either direction when reading the internal thermal sensors. But enabling or better say restoring the CHOP_TEMP_EN register to its default "enable" is the way to go because the base temperature the A20 reports is closer to reality.

Using sunxi-dbgreg to read out the thermal sensor inside A20's TP controller means then setting both bit 4 and 7 and therefore writing x90 instead of x10 to the TP_CTRL1 register:
  1. echo 'f1c25004:90' > /sys/devices/virtual/misc/sunxi-dbgreg/rw/write;
Copy the Code

tkaiser  
Edited by tkaiser at Tue Nov 11, 2014 02:22

Regarding effective heat dissipation:

This is the A20-Lime2 without heatsink. I performed the same set of 2 stress tests (each 15 minutes lasting, the first with 912 MHz CPU frequency, the second with 1008 MHz) three times under different conditions:



  • 10:25 - 11:00: The A20-Lime2 lies flat on a table (on the Lime2 A20, AXP209 and DRAM is on the top side of the PCB, on Banana Pi everything's on the bottom!) without an enclosure
  • 11:00 - 11:45: I put the Lime2 into a small box and put one DHT22 right above the A20)
  • 11:50 - 12:35: I uncased the Lime2 and the DHT22 and changed the position from horizontal to vertical

Position does matter since operating the device upright helps with convection. And approriate air flow does matter especially in situations when you want to operate your small device under high load for hours (I made tests/simulations eg. at the start of this thread and even managed to force an emergency shutdown of my Banana Pi due to overheating simulating an 'enclosure from hell' not allowing any air flow at all)

Some final words regarding effectiveness of different heatsinks. I used always the same routine to do a 15 minutes stress test, idle 15 minutes, increase CPU clock speed by 96 MHz and continue (the A20-Lime2 crashed at 1200 MHz, the Banana Pi not -- still no idea why):
  1. for i in 816000 912000 1008000 1104000 1200000 ; do
  2.         echo $i >/sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq
  3.         cd /data && stress -t 900 -c 2 -m 2 -i 2 -d 2
  4.         sleep 900
  5. done
Copy the Code
This is the A20-Lime2 now with the same ineffective heatsink I also applied to my second Banana Pi (temperatures above ambient temp at 1.1 GHz CPU/PMU 23°/24°C and 13°/11° when idle -- power consumption chart here)



This is my first Banana Pi with SMD heatsinks applied to CPU, PMU and DRAM. In this setup I put one DHT22 sensor -- the purple graph -- very close/nearby the PMU (image here) and directly above the A20's heatsink (in fact the presence of the DHT22 interferes with air flow so internal temperatures reported from CPU/PMU are higher than necessary. But this test run with the DHT22 so close to both chips let me believe the internal temperature values are reliable):



And this is the Banana Pi measured again but in the correct position with maximum convection effect possible, images here, here and here (temperatures above ambient temp at 1.1 GHz CPU/PMU 18°/17.5°C and 10°/7° when idle -- power consumption chart here)

tkaiser  
Edited by tkaiser at Tue Nov 11, 2014 13:33

Conclusion:

After a few days of investigation for an enclosure for A20 boards together with a 3.5 HDD I came to the following conclusions:

  • Don't trust the values you read (BTW: most users of an A20 board still believe the PMU's temperature value is 'CPU temp'). Both PMU's as well as SoC's temperature values might be inaccurate by a few degrees. But this doesn't matter
  • The temperatures that can be read out from devices (SoC, PMU, disk) depend on and scale almost linearly with ambient/surrounding temperature (keep this in mind when comparing test setups (!) and choosing/building an enclosure)
  • Random sampling is pretty useless. To get the real picture you have to monitor thermal values permanently at frequent intervals and you need the ambient temperature as well (at least once in the beginning when you try to measure the effects of an enclosure and load in different scenarios)
  • If you want to measure both internal and external sensors on a system that is under high load you might have to take special precautions since sensors might not respond immediately or work unreliable when the A20 is clocked very high or very low (workarounds can be found in this thread eg. using an I2C/W1 brigde for external sensors or setting up a daemon that 'caches' previously read values when the current value can not be queried due to timeouts. See the tutorial and files necessary to set this up using RPi-Monitor)
  • An enclosure with 'bad thermal design' might affect lifespan and performance (due to possible throttling) negatively
  • Convection works, adhesive heatsinks too (when combined with convection), heatsinks with thermal paste work the best (when combined with convection). But under normal conditions with only temporarely high loads it should be sufficient to ensure appropriate airflow and put your A20 board in an upright position
  • The temperature from the SoC originates from a thermal probe inside the A20's touchpanel controller. Since nobody outside of Allwinner knows the location of this controller inside the A20 (a 'huge' 20x20mm BGA) and nobody knows how accurately it operates it is both wrong to call this value 'CPU temperature' and to rely on the values reported at all (there might be hotspots on the CPU die's surfaces that exceed the TP controller's values extremely)
  • The only important temperature value is the one from the PMU (in the Banana's case an AXP209 -- a 6x6mm BGA). Its temperature scales somewhat linearly with CPU load but only when there's nothing else to do for the PMU (eg. charging a LiPo battery or powering a lot of connected USB peripherals. So if you keep an eye on power consumption and the PMU's temperature you can roughly estimate the CPU's temperature). The PMU's temperature seems to be the most important 'health indicator'
  • In normal network-related workload scenarios you won't be able to create that much load compared to the aforementioned stress tests. These are worst case scenarios. In reality you should take care of the power consumption of connected peripherals (eg. bus-powered USB disks) and the PMU's temperature. And keep in mind: a connected SATA disk that will be fed via the SATA power connector won't add to the PMU's overall power consumption but of course to the PSU's.
  • If you want to read out the SoC's thermal sensor you have to ensure that the CHOP_TEMP_EN bit is still enabled. Using the sunxi-dbgreg.ko method you have to write \x90 instead of \x10 to the TP controller's TP_CTRL1 register
  • If you just want to read out the sensors without setting up a monitoring system you could use my simple script referenced here or define shell functions

For now I'm done with this boring temperature stuff

tkaiser replied at Tue Nov 11, 2014 02:33
Conclusion:

After a few days of investigation for an enclosure for A20 boards together with a 3.5 H ...

Good conclusion. They are very useful for all the users.

deenbee  
I follow the official step to install from github but for me do not work cpu temperature need help

Escritorio 1_004.png

Any idea?

tkaiser  
Edited by tkaiser at Sat Nov 22, 2014 04:56
deenbee replied at Fri Nov 21, 2014 17:43
Any idea?


Seems like you just installed the original RPi-Monitor without the necessary modifications for A20 based boards like the Banana Pi? I would follow the step-by-step tutorial here and report back

Edited by destroyedlolo at Sat Nov 22, 2014 14:34
tkaiser replied at Tue Nov 11, 2014 09:33
BTW: most users of an A20 board still believe the PMU's temperature value is 'CPU temp'


It's what is indicated on Banana's wiki FAQ : question #20.
Would be worst to update it

Anyway, you made a very valuable contribution : thanks a lot.

FPeter  
Hi All!

I understand the conclusions above, but here is my clean solution instead of the "quick&ugly" to read TP temp of this SoC...

You can compile it as easy as "gcc sunxi_tp_temp.c -o sunxi_tp_temp"

Main program is really simple, further explanation is not required I think...

In addition, You can use it as template to explore the whole world of A20 register map, not only this temperature value! But beware, its a really dangerous area! If You don't understand what and where to write, then use only the read function first! You can easily freeze Your BPi, furthermore You can cause filesystem corruption or hardware damage if You accidentaly modify wrong registers!

Its not a kernel module, but the motto is the same as in the programmer's Bible called "The Linux Kernel Module Programming Guide" :

"You know C, you've written a few normal programs to run as processes, and now you want to get to where the real action is, to where a single wild pointer can wipe out your file system and a core dump means a reboot."


mmio_sunxitemp.zip

1.83 KB, Downloads: 118

tkaiser  
Edited by tkaiser at Mon Nov 24, 2014 10:16
FPeter replied at Mon Nov 24, 2014 09:52
Main program is really simple, further explanation is not required I think...


Thx! Your solution is far better than the current approach.

But line 24 should read
  1.   mmio_write(0x01c25004, 0x00000090) ;
Copy the Code
instead and it would be great if you could parse for command line arguments and in case it's 'f' or 'F' simply do a conversion to Fahrenheit. Then in Bananian the shell function soctemp that misuses syslog right now could be replaced with your binary!

Compare with https://dev.bananian.org/view.php?id=56 and http://pastebin.com/jNLfSS4U please
  1. root@bananas /tmp # ./mmio_sunxitemp && /usr/local/bin/soctemp
  2. 41.2
  3. approx. 41.2°C
  4. root@bananas /tmp # ./mmio_sunxitemp && /usr/local/bin/soctemp f
  5. 32.6
  6. approx. 90.68°F
Copy the Code

FPeter  
here are some improvements:

- value of 0x01c25004 modified to 0x00000090
- calibration value is not a constant now, should be provided as 1st argument
- optional arguments:
        -c or -C for Celsius (default)
        -f or -F for Fahrenheit
        -k or -K for Kelvin
        -d or -D for debug mode
  1. root@bananapi:/home/pi/sunxi_tp_temp# ./sunxi_tp_temp 1447
  2. 34.8
  3. root@bananapi:/home/pi/sunxi_tp_temp# ./sunxi_tp_temp 1447 -f
  4. 95.0
  5. root@bananapi:/home/pi/sunxi_tp_temp# ./sunxi_tp_temp 1447 -K -d
  6. w 0x01c25000: 0027003f
  7. w 0x01c25010: 00040000
  8. w 0x01c25018: 00010fff
  9. w 0x01c25004: 00000090
  10. r 0x01c25020: 00000707
  11. 308.35 Kelvin
Copy the Code
do You have any other idea?

sunxi_tp_temp.zip

2.17 KB, Downloads: 226

You have to log in before you can reply Login | Sign Up

Points Rules