精华区文章阅读

发信人: he (无情的雨), 信区: Embedded_system
标题: linux fpr ppc chapter 19
发信站: 哈工大紫丁香 (2001年06月04日11:10:35 星期一), 站内信件

----------------------------------------------------------------------------
----
19. Performance
19.1 CPU core
Cache
Firstly, make sure you have both the I and D caches enabled!
Also, make sure you have serialization disabled (Set ICTRL to 0x7).
To get maximum performance, you need to enable copyback data cache. This can
be disabled in order to make the standard Linux/PPC libraries work without
recompiling. If you build your own glibc as described under Runtime Library,
you can enable copyback. Look for a "make config" option, or grep for DC_SF
WT in
arch/ppc/kernel/head.S
and change the
#if 0
to
#if 1
.
.
BogoMIPS
The BogoMIPS value on 8xx processors should be within 1% or so of the actual
CPU core frequency, allowing for rounding & minor timing calculation errors
. This makes it a useful sanity check to verify that the internal clock mult
iplier is set correctly, and that the I-cache is turned on. However, note th
at the calculation of the BogoMIPS value is still tied to the external clock
source and internal prescaler settings, so it shouldn't be solely relied on
to verify that the core frequency really is what you think it should be. A
simple cross-check is to perform a 'sleep 10' at the shell prompt, and time
it with a watch to check that you're at least in the ballpark. It's wise to
measure your system more accurately than this with a CRO at least once.
Also, beware that the BogoMIPS rating should not be used as a general CPU pe
rformance measure; see: http://linuxdoc.org/HOWTO/mini/BogoMips.html
19.2 Profiling
There are numerous options available for system profiling, depending on what
you wish to measure, and how invasive you are prepared to be.
/proc/profile
/proc/profile is a standard kernel feature which provides simple kernel prof
iling based on Instruction Pointer sampling in the periodic timer interrupt
routine. It's simplistic but effective, and low overhead since the interrupt
is going to happen anyway. The data is processed with readprofile which loo
ks up the System.map to show which kernel functions are using the most CPU t
ime. It doesn't work for modules yet so at present you need to compile them
in for profiling.
You need to enable this at boot time by passing profile=2 on the command lin
e; The number gives the power of 2 granularity used for the counters -- 2 wi
ll give you a seperate counter for each PowerPC instruction (each 4 bytes).
Higher numbers consume less memory and give less precise results. The data f
rom /proc/profile will be in target byte order, so if you're cross-developin
g you may need to either byte swap it, or compile readprofile to run on your
target.
The PowerPC branch of the Linux kernel has been slow to implement the Instru
ction Pointer sampling function necessary to generate the /proc/profile data
. If it isn't implemented in your kernel, you'll see that readprofile always
shows zero time for every kernel function. In this case you need to apply t
he profile.patch from: http://members.xoom.com/greyhams/linux/patches/
Linux Trace Toolkit
http://www.opersys.com/LTT
The Linux Trace Toolkit works with an instrumented Linux kernel by saving ti
me-stamped records of important kernel events to a binary data file. A data
decoder converts the binary data to text and calculates statistical summarie
s, such as percent processor utilization by each process. The toolkit also i
ncludes an integrated environment that graphically displays the results and
provides search capability.
provides search capability.
A version for embedded PowerPC targets is now available from: ftp://ftp.mvis
ta.com/pub/LTT.
gprof
All the usual Linux user mode profiling tools like gprof are available.
kernprof
http://oss.sgi.com/projects/kernprof
This project aims to make full gprof profiling available for the kernel. How
ever, it hasn't been ported to the PowerPC architecture yet.
19.3 IDMA
Beware that IDMA on the 860 is not designed for high performance, and the CP
U gets better throughput with explicit cache bursted programmed I/O. Search
for IDMA for more discussion.
Confusion sometimes arises because DMA transfers in most systems are faster
than CPU transfers, whereas here the reverse is generally true. Furthermore,
IDMA transfers eat into CPM processing time, limiting throughput on other c
ommunications modules at the same time.
19.4 Network
To get good TCP/IP performance, you need a fast CPU. Using the FEC, a 50 MHz
860P will run about 30 Mbits/sec TCP/IP, and a 100 MHz 860P will run about
60 Mbits/sec TCP/IP. The bottleneck is the protocol and application processi
ng in the PPC core. The performance of a TCP/IP connection scales nearly lin
early with the processor speed.
If you need to go faster, use the 8260.
19.5 Optimisation
Optimising everything for space using gcc's -Os option is likely to provide
both the smallest code size and best performance, because it inhibits loop u
nrolling optimisation which tends to have a negative effect on embedded proc
essors with relatively small cache sizes. Furthermore, PowerPC processors ca
n speculatively execute branches overlapped with other loop instructions, ma
king the branch effectively execute in zero cycles so loop unrolling is unne
cessary in many circumstances.
----------------------------------------------------------------------------
----
--

※ 来源:·哈工大紫丁香 bbs.hit.edu.cn·[FROM: 202.118.235.250]

Embedded 版 (精华区)