Experimental Physics and
Industrial Control System

"Ernest L. Williams Jr." <[email protected]> · Wed, 16 Aug 2006 18:42:10 -0400

On Wed, 2006-08-16 at 17:22 -0500, Andrew Johnson wrote:
> Till Straumann wrote:
> > On Thu, 2006-08-10 at 14:58 -0500, Andrew Johnson wrote:
> > 
> >> Interrupts may not be as quick at actually getting to the CPU as a 
> >> Target Abort - I don't know whether modern CPUs finish off any/all 
> >> instructions that they've already started running before they actually 
> >> switch to processing the exception, but it's likely that there will be 
> >> a number of instructions pending.  This also supposes that interrupts 
> >> are enabled at the time the bus error gets flagged.
> > 
> > Yes, the latter is true. However, VME access is so slow that having
> > interrupts disabled around longer manipulations is not a good idea
> > anyways.
> 
> It is sometimes impossible to write code that has to manipulate the 
> interrupt registers of a VME slave card without disabling interrupts to 
> the CPU.
> 
> > Note that the 'machine check' generated by the target abort is also
> > just an external interrupt line. I can't see how that differs much
> > from using EE. On board designs using the universe, the target abort
> > is generated by the host bridge and propagated via the MCP or TSA
> > line to the CPU and therefore inherently asynchronous to instruction
> > execution also.
> 
> The Machine Check exception generated by the Target Abort is synchronous 
> with the termination of the read cycle that caused the VME bus error, 
> and it is thus possible to determine the instruction that caused the 
> fault.  For example, on an MVME2700 (Universe-2) with my BSP:
> 
> mv2700> d 0xf0000000
> f0000000:
> VME Bus Error accessing A24: 0x000000
> machine check
> Exception next instruction address: 0x001ba5e0
> Machine Status Register: 0x0008b030
> Condition Register: 0x20004084
> Task: 0x1d3b1d0 "tShell"
> 
> A disassembly shows the exception instruction:
> 
> mv2700> l 0x001ba5d0
> 0x1ba5d0  3c60001f    lis         r3,0x1f # 31
> 0x1ba5d4  3ba10030    addi        r29,r1,0x30 # 48
> 0x1ba5d8  386302b4    addi        r3,r3,0x2b4 # 692
> 0x1ba5dc  a0090000    lhz         r0,0(r9)
> 0x1ba5e0  901e0004    stw         r0,4(r30)
> 0x1ba5e4  93010030    stw         r24,48(r1)
> 0x1ba5e8  a09e0006    lhz         r4,6(r30)
> 0x1ba5ec  4cc63182    crxor       crb6,crb6,crb6
> 0x1ba5f0  4bfe1b61    bl          0x19c150 # printf
> 
> The instruction at 0x001ba5dc is the lhz instruction that tried to read 
> the location at A24:000000
> 
> >> If the Bus Error occurs inside an interrupt service routine,
> > 
> > I consider this a fatal, nonrecoverable error.
> 
> I also consider it a pretty fatal error, but I want my hardware and OS 
> to be able to tell me where it was when the problem occurred so I can 
> quickly figure out what actually happened.
> 
> > you're invariably gonna see more of this as CPUs get faster ;-)
> 
> Actually CPUs have pretty much stopped getting faster nowadays (although 
> the highest speeds haven't filtered through to the VME world yet); we're 
> just putting them in parallel to achieve speedups now...
> 
> > In any case, IMO, a bus error should be considered a serious error that
> > must be avoided (except for 'probing' during initialization)
> > because of the significant latencies that can be introduced
> > by a VME bus timeout.
> 
> I'm not disputing that we should avoid bus errors, but they are a fact 
> of life in a failing VME system.  Unfortunately the Tempe chip's flawed 
> design makes the system's response to one much less than ideal, given 
> that the Target Abort mechanism is available on the PCIbus and Tundra 
> have already managed to implement the necessary circuitry to use it in 
> the Universe-2 chip.
> 
> > Of course, write operations are completely asynchronous
> > and in that case, the only thing that can be done is reporting
> > that an error happened but there is no way to relate it
> > to a particular task/PC.
> > 
> > Note that this is also true for the Universe (with write-posting
> > enabled).
> 
> I am less concerned about write posting (I enable this myself) and even 
> bus errors from write cycles, since they don't directly affect the 
> operation of the running task and will almost always be surrounded by 
> read cycles anyway so a card that develops a fault will soon signal its 
> problem by faulting a read operation.
> 
> What I object to is the completion of a failing read cycle with an 
> all-1s bitpattern, because this can and probably will break any existing 
> device drivers.  In the past a driver was guaranteed that a bus error on 
> a read cycle would stop it immediately at the read instruction and thus 
> prevent further operation, whereas now drivers will have to be very 
> defensive about all the data they read from the VMEbus.
> 
> That's not going to be good for performance or portability, especially 
> where all-1's is a valid bitpattern from a register that must be read 
> inside an ISR (how can the ISR tell whether the value it read was real 
> or not?  The only way to find out is to ask the Tempe chip, so the code 
> is no longer portable).
> 
> > However, in contrast to the universe, write posting cannot be disabled
> > on the Tsi148 and that introduces problems with VME ISRs: 
> ...
> > The only remedy here is reading something back from
> > the device prior to letting the ISR return (reading anything
> > flushes the tsi148's write-FIFO)
> 
> This is actually something that all VME ISRs should be doing anyway, 
> since even the VMEchip2 (as used on the MVME167 et al) implemented write 
> posting.
> 
> > => IMO,  the Tsi148's  new features
> >      (fast 2eVME and SST transfers among others)
> >      outweigh the disadvantage that write-posting
> >      cannot be disabled.
> >         I don't share your negative assessment and
> >      recommendation to stay away from 6100s.
> 
> If you need the new features and speed then you'll probably be willing 
> to recode any existing drivers or just accept that random things may 
> happen in the event that some card fails.  For operational sites like 
> the APS with 224 different types of VME card used in our IOCs, 
> revisiting all our device drivers isn't something we want to have to do...

We are concerned about the VMEBus issues that you raised.  
The MVME2100 will be End-of-Life (EOL) soon.  We are counting on the
MVME3100 and MVME6100 as successors.  

So, I have filed a technical concern with both Motorola and WindRiver.
Of course, this will lead to TUNDRA but we need to get this resolved if
we want to move forward and have reliability.

I will post the results back here hopefully in the near future.

Thanks,
Ernest
SNS Control Systems Group
ORNL

> 
> - Andrew

Experimental Physics and Industrial Control System

Experimental Physics and
Industrial Control System