Hello, i would like to ask for help since i am feeling a bit clueless here.
I recently bought a low end dedicated machine that was supposed to host
some services: squid, proftpd and rtorrent.
I installed debian lenny and immediately updated to squeeze and
configured the services. I started rtorrent but after the machine
reaches heavy load ( > 10MBps network traffic, maxed out cpu), it holds
for a while then all network connections drop and i have to order a hard
reset in order to bring it back online.
I thought it was a misconfiguration issue, so i tried reconfiguring the
server and installing ubuntu 10.04 on it, but i’m getting the same results.
I had a look at /var/log/kernel.log and on ubuntu i am seeing some
“Clocksource tsc unstable” messages right before the machine crashes.
I can see the same kind of messages on squeeze aswell, just not that
close to the reboots as they were on ubuntu. Google tells me they might
have something to do with cpu frequency scaling. There’s loads of
reports by users like me who are experiencing random freezes. Seems
though that there is no clear answer: people solved the issue with video
card driver updates, replacing bad hardware, changing the frequency
scaling governor, and so on.
So far i only played around with the frequency scaling governor, setting
it to “performance” seems to freeze the machine quicker than with the
Here are the cpu specs of the machine:
# cat /proc/cpuinfo
processor : 0
vendorid : AuthenticAMD
cpu family : 15
model : 39
model name : AMD Athlon™ 64 Processor 3700+
stepping : 1
cpu MHz : 2200.000
cache size : 1024 KB
fdivbug : no
hltbug : no
f00fbug : no
comabug : no
fpu : yes
fpuexception : yes
cpuid level : 1
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsropt
lm 3dnowext 3dnow up pni lahflm
bogomips : 4398.97
clflush size : 64
cachealignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp tm stc
# cat /sys/devices/system/cpu/cpu0/cpufreq/scalingavailablegovernors
conservative userspace powersave ondemand performance
# cat /sys/devices/system/cpu/cpu0/cpufreq/scalingavailablefrequencies
1000000 1800000 2000000 2200000
# cat /sys/devices/system/clocksource/clocksource0/availableclocksource
I asked the datacenter support to perform a hardware check and they
tested the machine for 8 hours without errors.
Now. How can i find out what’s going on with this server? I’m pretty
sure its faulty hardware, but i have no proof to show to the datacenter
I am currently running squeeze 2.6.32-5-686-bigmem. The machine has
1024MB of ram and 2x160Gb Sata HDDs. The NIC is a 100MBit realtek one,
with proper drivers from the firmware-realtek debian package.
I would love to have some opinions on how to deal with this.
Helmut Grohne [ Editor ]
There probably is no straight forward answer to this question. The first task obviously is to find out what is going wrong. In the cases I experienced so far it helped a lot to install a monitoring solution.
I can recommend munin (packages munin munin-node munin-plugins-extra). It collects data every five minutes and draws nice plots. It comes with plugins for monitoring temperatures, disks (SMART), and different aspects of load. Note that not all plugins are configured automatically.
Another tool that is entitled to solve this task is collectd.
Once you have a monitoring solution installed, you should try to reproduce the crashes and look at the graphs. In most cases you will find an oddity that can serve as a starting point to dig further.