Experimental Physics and
Industrial Control System

Andrew Johnson <[email protected]> · Wed, 16 Aug 2006 17:22:37 -0500

Till Straumann wrote:

On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
Interrupts may not be as quick at actually getting to the CPU as a
Target Abort - I don't know whether modern CPUs finish off any/all
instructions that they've already started running before they actually
switch to processing the exception, but it's likely that there will be
a number of instructions pending. This also supposes that interrupts
are enabled at the time the bus error gets flagged.

Yes, the latter is true. However, VME access is so slow that having
interrupts disabled around longer manipulations is not a good idea
anyways.

It is sometimes impossible to write code that has to manipulate the
interrupt registers of a VME slave card without disabling interrupts to
the CPU.
Note that the 'machine check' generated by the target abort is also
just an external interrupt line. I can't see how that differs much
from using EE. On board designs using the universe, the target abort
is generated by the host bridge and propagated via the MCP or TSA
line to the CPU and therefore inherently asynchronous to instruction
execution also.

The Machine Check exception generated by the Target Abort is synchronous
with the termination of the read cycle that caused the VME bus error,
and it is thus possible to determine the instruction that caused the
fault. For example, on an MVME2700 (Universe-2) with my BSP:
mv2700> d 0xf0000000
f0000000:
VME Bus Error accessing A24: 0x000000
machine check
Exception next instruction address: 0x001ba5e0
Machine Status Register: 0x0008b030
Condition Register: 0x20004084
Task: 0x1d3b1d0 "tShell"
A disassembly shows the exception instruction:
mv2700> l 0x001ba5d0
0x1ba5d0  3c60001f    lis         r3,0x1f # 31
0x1ba5d4  3ba10030    addi        r29,r1,0x30 # 48
0x1ba5d8  386302b4    addi        r3,r3,0x2b4 # 692
0x1ba5dc  a0090000    lhz         r0,0(r9)
0x1ba5e0  901e0004    stw         r0,4(r30)
0x1ba5e4  93010030    stw         r24,48(r1)
0x1ba5e8  a09e0006    lhz         r4,6(r30)
0x1ba5ec  4cc63182    crxor       crb6,crb6,crb6
0x1ba5f0  4bfe1b61    bl          0x19c150 # printf
The instruction at 0x001ba5dc is the lhz instruction that tried to read
the location at A24:000000
If the Bus Error occurs inside an interrupt service routine,

I consider this a fatal, nonrecoverable error.

I also consider it a pretty fatal error, but I want my hardware and OS
to be able to tell me where it was when the problem occurred so I can
quickly figure out what actually happened.
you're invariably gonna see more of this as CPUs get faster ;-)

Actually CPUs have pretty much stopped getting faster nowadays (although
the highest speeds haven't filtered through to the VME world yet); we're
just putting them in parallel to achieve speedups now...
In any case, IMO, a bus error should be considered a serious error that
must be avoided (except for 'probing' during initialization)
because of the significant latencies that can be introduced
by a VME bus timeout.

I'm not disputing that we should avoid bus errors, but they are a fact
of life in a failing VME system. Unfortunately the Tempe chip's flawed
design makes the system's response to one much less than ideal, given
that the Target Abort mechanism is available on the PCIbus and Tundra
have already managed to implement the necessary circuitry to use it in
the Universe-2 chip.
Of course, write operations are completely asynchronous
and in that case, the only thing that can be done is reporting
that an error happened but there is no way to relate it
to a particular task/PC.
Note that this is also true for the Universe (with write-posting
enabled).

I am less concerned about write posting (I enable this myself) and even
bus errors from write cycles, since they don't directly affect the
operation of the running task and will almost always be surrounded by
read cycles anyway so a card that develops a fault will soon signal its
problem by faulting a read operation.
What I object to is the completion of a failing read cycle with an
all-1s bitpattern, because this can and probably will break any existing
device drivers. In the past a driver was guaranteed that a bus error on
a read cycle would stop it immediately at the read instruction and thus
prevent further operation, whereas now drivers will have to be very
defensive about all the data they read from the VMEbus.
That's not going to be good for performance or portability, especially
where all-1's is a valid bitpattern from a register that must be read
inside an ISR (how can the ISR tell whether the value it read was real
or not? The only way to find out is to ask the Tempe chip, so the code
is no longer portable).
However, in contrast to the universe, write posting cannot be disabled

on the Tsi148 and that introduces problems with VME ISRs:

...

The only remedy here is reading something back from
the device prior to letting the ISR return (reading anything
flushes the tsi148's write-FIFO)

This is actually something that all VME ISRs should be doing anyway,
since even the VMEchip2 (as used on the MVME167 et al) implemented write
posting.
=> IMO,  the Tsi148's  new features
     (fast 2eVME and SST transfers among others)
     outweigh the disadvantage that write-posting
     cannot be disabled.
        I don't share your negative assessment and
     recommendation to stay away from 6100s.

If you need the new features and speed then you'll probably be willing
to recode any existing drivers or just accept that random things may
happen in the event that some card fails. For operational sites like
the APS with 224 different types of VME card used in our IOCs,
revisiting all our device drivers isn't something we want to have to do...
- Andrew
--
Not everything that can be counted counts,
and not everything that counts can be counted.
  -- Albert Einstein

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System