Friday, January 12, 2007
Embedded Design for Performance (MIPS)
An embedded system is waiting for an external event to occur and then
performance some action based on those events without user interaction. The
main considerations are generally performance, footprint, and power.
1. Keep the architecture as simple as possible, but no simpler
Below are architectures list by there level of
complexity.
A. Polling Loop
The CPU loops and polls all the external inputs. It's festible to write
all or most of this in assembly. The latency for servicing any input
is the worst time through the loop once. This is power hungry so it's
no good for anything running on batteries.
B. Interrupt Driven with Single Loop
ISR will handle external events. The loop is left to do data crunching.
Locking becomes an issue here. The latency for any interrupt is the
processing time for any higher, possible same, priority interrupts plus
any lock outs by the main loop or other ISRs. It's still festible to
write most of all of this in assembly. The loop can poll or wait for an
event. For power sensitive applications, the loop should wait. This
applies to the more complex architectures, too.
C. Single Thread
Basically the same as B, but you have a stack and an RTOS running. The
RTOS probably supports multiple threads and is just chewing up extra
clock cycles. Benefits for an RTOS is that it should come with tools.
Most code will be written in a higher level language such as C.
D. Multiple Threads
Similar to the single thread model plus scheduling overheads and more
sychronization issues.
E. Multiple Processes
Each process should have its own memory space, so you get an extra TLB
overhead. The scheduling algorithm is probably more complex and thus
adds to the scheduling overhead. Sharing resources between processes
is generally much more expensive.
F. SMP Processors
These are CPUs with multiple cores. The cores share memory and cache.
The architecture will most likely be mutliple threads or processes.
Sychronziation between cores is more expensive.
2. Design around the Cache Flow
To minimized cache misses, design modules and data based on when they'll
be used by the CPU.
2.1 Modulization
Modulization is breaking a large piece of code into smaller managable
pieces or modules. A module could be a file, library, object, or a
section in a file. They're usually designed around a hardware component
or an abstract object. For better performance, these modules should be
designed around completing a task.
Example of typical serial driver design.
1. A method for the user to send and receive data.
2. Software queuing code
3. Hardware manipulators routines
Example of a serial driver design around cache flow
1. Transmit Flow
a. A method for user to send data
b. TX queueing code
c. Hardware manipulators routines for transmission
2. Transmit Flow
a. Hardware manipulators routines for retrieving data
b. RX queueing code
c. A method for user to receive data
The latter method will reduce cache misses.
2.2 Cache Friendly Data
Group data that are likely to be used together on the same cache line.
Check your particular CPU to see what the cache line is.
Example of how to do this for a CPU with 8 byte cache line.
struct cache_friendly_s {
/* cache line 8 bytes - cnt 0 */
int32 a;
int32 b;
/* cache line 8 bytes - cnt 1 */
char c[4];
int32 d;
}
Notice the comments also has a count. This information is useful to
prefech, which will be cover later.
To help you count properly, make sure each data member is CPU aligned.
If the CPU is 32 bits, allocated all int16 in twos and chars in 4s.
There's two reasons for this. One it improves performance if all data
is CPU aligned. Two, the compiler will likely do this for you if you
don't by adding fillers.
2.3 Don't Wait, Prefetch
Prefetching data is having the CPU load some memory into cache that'll
be needed by some instruction in the future. This is not guarantee to
improve performance, and will actually hurt performance if prefetch
is seldomly needed. It's a good idea to do this after the code has been
instrumented so you can gauge how effect prefetching is. You can
write inline assembly to prefetch or use __builtin_prefetch() in gcc.
It's also possible to prefetch hardware registers.
a. Reserve register to do prefetch with. Using gcc with a MIPS core,
the compiler option --ffixed-t9 will reserve general purpose
register t9 for the user to manipulate.
b. Write inline assembly code to load the hardware register to the
general purpose register.
c. Write inline assembly to access that general purpose register.
The concept to grasp here is that the load instruction (b) will not
block the pipeline. Accessing the register (c) will block if (b)
has not finished loading the hardware register.
Beware:
a. Ask yourself if it's OK to prefetch this register early.
b. Don't make any library calls that is not compile with the same
--fixed-xx option.
c. Consider disabling all interrupts between prefetch and actually
accessing the register because any files not compiled with the
same --fixed-xx option may thrash this register, including your
RTOS's scheduler and ISR handlers.
d. Any inline assembly or assembly code will disregard this compiler
option.
performance some action based on those events without user interaction. The
main considerations are generally performance, footprint, and power.
1. Keep the architecture as simple as possible, but no simpler
Below are architectures list by there level of
complexity.
A. Polling Loop
The CPU loops and polls all the external inputs. It's festible to write
all or most of this in assembly. The latency for servicing any input
is the worst time through the loop once. This is power hungry so it's
no good for anything running on batteries.
B. Interrupt Driven with Single Loop
ISR will handle external events. The loop is left to do data crunching.
Locking becomes an issue here. The latency for any interrupt is the
processing time for any higher, possible same, priority interrupts plus
any lock outs by the main loop or other ISRs. It's still festible to
write most of all of this in assembly. The loop can poll or wait for an
event. For power sensitive applications, the loop should wait. This
applies to the more complex architectures, too.
C. Single Thread
Basically the same as B, but you have a stack and an RTOS running. The
RTOS probably supports multiple threads and is just chewing up extra
clock cycles. Benefits for an RTOS is that it should come with tools.
Most code will be written in a higher level language such as C.
D. Multiple Threads
Similar to the single thread model plus scheduling overheads and more
sychronization issues.
E. Multiple Processes
Each process should have its own memory space, so you get an extra TLB
overhead. The scheduling algorithm is probably more complex and thus
adds to the scheduling overhead. Sharing resources between processes
is generally much more expensive.
F. SMP Processors
These are CPUs with multiple cores. The cores share memory and cache.
The architecture will most likely be mutliple threads or processes.
Sychronziation between cores is more expensive.
2. Design around the Cache Flow
To minimized cache misses, design modules and data based on when they'll
be used by the CPU.
2.1 Modulization
Modulization is breaking a large piece of code into smaller managable
pieces or modules. A module could be a file, library, object, or a
section in a file. They're usually designed around a hardware component
or an abstract object. For better performance, these modules should be
designed around completing a task.
Example of typical serial driver design.
1. A method for the user to send and receive data.
2. Software queuing code
3. Hardware manipulators routines
Example of a serial driver design around cache flow
1. Transmit Flow
a. A method for user to send data
b. TX queueing code
c. Hardware manipulators routines for transmission
2. Transmit Flow
a. Hardware manipulators routines for retrieving data
b. RX queueing code
c. A method for user to receive data
The latter method will reduce cache misses.
2.2 Cache Friendly Data
Group data that are likely to be used together on the same cache line.
Check your particular CPU to see what the cache line is.
Example of how to do this for a CPU with 8 byte cache line.
struct cache_friendly_s {
/* cache line 8 bytes - cnt 0 */
int32 a;
int32 b;
/* cache line 8 bytes - cnt 1 */
char c[4];
int32 d;
}
Notice the comments also has a count. This information is useful to
prefech, which will be cover later.
To help you count properly, make sure each data member is CPU aligned.
If the CPU is 32 bits, allocated all int16 in twos and chars in 4s.
There's two reasons for this. One it improves performance if all data
is CPU aligned. Two, the compiler will likely do this for you if you
don't by adding fillers.
2.3 Don't Wait, Prefetch
Prefetching data is having the CPU load some memory into cache that'll
be needed by some instruction in the future. This is not guarantee to
improve performance, and will actually hurt performance if prefetch
is seldomly needed. It's a good idea to do this after the code has been
instrumented so you can gauge how effect prefetching is. You can
write inline assembly to prefetch or use __builtin_prefetch() in gcc.
It's also possible to prefetch hardware registers.
a. Reserve register to do prefetch with. Using gcc with a MIPS core,
the compiler option --ffixed-t9 will reserve general purpose
register t9 for the user to manipulate.
b. Write inline assembly code to load the hardware register to the
general purpose register.
c. Write inline assembly to access that general purpose register.
The concept to grasp here is that the load instruction (b) will not
block the pipeline. Accessing the register (c) will block if (b)
has not finished loading the hardware register.
Beware:
a. Ask yourself if it's OK to prefetch this register early.
b. Don't make any library calls that is not compile with the same
--fixed-xx option.
c. Consider disabling all interrupts between prefetch and actually
accessing the register because any files not compiled with the
same --fixed-xx option may thrash this register, including your
RTOS's scheduler and ISR handlers.
d. Any inline assembly or assembly code will disregard this compiler
option.