ARM v6 (Raspberry Pi) vs ARM v7 (BeagleBone)

Synopsis: ARM v6 CPUs are significantly slower than ARMv7 CPUs at the same clock speed, at least when running Debian (armhf) on them. I know it's a rather naive benchmark, but I tried running a tiny Perl program to find all the prime numbers under 100,000 on both the Raspberry Pi and the BeagleBone. They're both 700 MHz ARM CPUs, but the Raspberry runs on the older v6 spec CPU. Surprisingly, this seems to make a huge difference to performance. My desktop (3GHz i7) - 3.3 seconds BeagleBone (720 MHz ARMv7)- 68 seconds Raspberry Pi (700 MHz ARMv6) - 125 seconds I thought I'd try it quickly in Scala, but it seems the JVM isn't very well optimised on ARM yet :( So I tried it with just doing primes to 10,000 instead. Desktop - 0.33 s Beagle - 19 s (zero) / 34 s (jamvm) Raspberry - 58 s (jamvm) / 79 s (zero) It's curious to note that the best JVM varies between the architectures; Zero was a lot faster than JamVM on the Beagle, but it was a lot slower on Raspberry Pi. (for this naive benchmark) Does anyone know why that is, or if there's any way to get better jvm performance?

Toby Corkindale writes:
I know it's a rather naive benchmark, but I tried running a tiny Perl program to find all the prime numbers under 100,000 on both the Raspberry Pi and the BeagleBone.
FWIW I use this synthetic benchmark: http://homepages.cwi.nl/~steven/dry.c On a TF101 (tegra2), with cc (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1, I get cc -DPASS2 -O dry.c dry1.o -o dry2o [...] Trying 50000000 runs through Dhrystone: Microseconds for one run through Dhrystone: 0.4 Dhrystones per Second: 2596054 On an i7-870, with gcc-4.4.real (Ubuntu 4.4.3-4ubuntu5) 4.4.3, I get Trying 500000000 runs through Dhrystone: Microseconds for one run through Dhrystone: 0.0 Dhrystones per Second: 25163562 IIUC the cool kids use SPECint (not synthetic) but you gotta pay for it. https://en.wikipedia.org/wiki/SPECint
They're both 700 MHz ARM CPUs, but the Raspberry runs on the older v6 spec CPU. Surprisingly, this seems to make a huge difference to performance.
Equivalent to comparing a 3GHz Pentium III and a 3GHz Pentium 4. You're running Debian armhf on both, and while that *supports* v6 (unlike Ubuntu arm/armhf), it may still be optimized for v7. Generating benchmark numbers is easy, interpreting them is hard :-)
I thought I'd try it quickly in Scala, but it seems the JVM isn't very well optimised on ARM yet :(
ARMv6 implements some JVM bytecodes directly in hardware. FOSS JVMs cannot use them. https://en.wikipedia.org/wiki/Jazelle I'm a bit hazy on the current state of play WRT. ARMv7. Re "how do I make it go faster", all of the usual funroll-loops.org discussion applies.

On 21/11/12 14:12, Trent W. Buck wrote:
I thought I'd try it quickly in Scala, but it seems the JVM isn't very well optimised on ARM yet :( ARMv6 implements some JVM bytecodes directly in hardware. FOSS JVMs cannot use them. https://en.wikipedia.org/wiki/Jazelle I'm a bit hazy on the current state of play WRT. ARMv7.
Re "how do I make it go faster", all of the usual funroll-loops.org discussion applies.
According to the wikipedia page, Jazelle is deprecated and only supported to a trivial level (that provides no acceleration) these days. The successor was announced in 2005 and called ThumbEE. However it in turn appears to have been deprecated as of 2011. So no great loss (now) that we don't have any FOSS JVM support for Jazelle. I think?

It seems that the ARMv6 doesn't actually have any divide instruction in the CPU, whereas some ARMv7 variations do get it. So it's possible my naive benchmark was picking on a specific weakness of the raspberry pi's CPU. On 21/11/12 13:03, Toby Corkindale wrote:
Synopsis: ARM v6 CPUs are significantly slower than ARMv7 CPUs at the same clock speed, at least when running Debian (armhf) on them.
I know it's a rather naive benchmark, but I tried running a tiny Perl program to find all the prime numbers under 100,000 on both the Raspberry Pi and the BeagleBone.
They're both 700 MHz ARM CPUs, but the Raspberry runs on the older v6 spec CPU.
Surprisingly, this seems to make a huge difference to performance.
My desktop (3GHz i7) - 3.3 seconds BeagleBone (720 MHz ARMv7)- 68 seconds Raspberry Pi (700 MHz ARMv6) - 125 seconds

On 21/11/12 14:46, Toby Corkindale wrote:
It seems that the ARMv6 doesn't actually have any divide instruction in the CPU, whereas some ARMv7 variations do get it.
So it's possible my naive benchmark was picking on a specific weakness of the raspberry pi's CPU.
Nah, actually it seems the particular variant of the ARMv7 in the beaglebone doesn't have a divide operator either. (Or if it does, gcc isn't compiling to use it)
participants (2)
-
Toby Corkindale
-
trentbuck@gmail.com