The quest for an interesting bug

Discussion:

Add Reply

John Rumm

2024-10-07 14:51:51 UTC

Only tenuous links to DIY for this one, but anyone with an interest in
electronics, high power broadcast kit and software may find this tale of
hunting for a mysterious "software" bug enjoyable:

https://wiki.diyfaq.org.uk/index.php/Hardware_and_software_fault_finding_-_an_interesting_bug

Let me know if it needs more explanatory stuff to de-geek the technical
bits...
--
Cheers,

John.

/=================================================================\
| Internode Ltd - http://www.internode.co.uk |
|-----------------------------------------------------------------|
| John Rumm - john(at)internode(dot)co(dot)uk |
\=================================================================/

Alan J. Wylie

2024-10-07 17:24:00 UTC

Permalink

Post by John Rumm
Only tenuous links to DIY for this one, but anyone with an interest in
electronics, high power broadcast kit and software may find this tale
https://wiki.diyfaq.org.uk/index.php/Hardware_and_software_fault_finding_-_an_interesting_bug
Let me know if it needs more explanatory stuff to de-geek the
technical bits...

That brings back so many memories of sitting in an air conditioned,
polythene sheeted on a frame of 2 by 4s cubicle next to a mult-wheel
crankshaft journal grinder in Ford No. 2 Engine plant in Cleveland
Ohio. I, too, had an In-Circuit Emulator: an Intel ICE-85 8085 with an
8" floppy.

Bugs in the software could have "interesting" consequences: the advance
of the grinding wheels was entirely under microprocessor
control. Slamming it forward too hard, or even a failure of the motor
that rotated the workpiece could result in the crankshaft spinning so
fast it broke out of its head/tailstock, shattering against the floor
and the broken parts bouncing away at very high velocity.

See
https://www.newall.org.uk/files/2017/03/Crankshaft-Grinders-Flyer.pdf

--
Alan J. Wylie https://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

John Rumm

2024-10-07 20:14:56 UTC

Permalink

Post by Alan J. Wylie

I have used a few different Intel ICE models - the first was a "state of
the art" (for the time) 80386 one... then later a 80186 version, and
finally a 8051 version. Getting ever more feeble with each iteration :-)

Less popular than they used to be from a software debug point of view -
probably because making one fast enough for modern chips is hard, and
the built in debug capabilities in modern CPUs is so much more advanced
than was the case.

Post by Alan J. Wylie
Bugs in the software could have "interesting" consequences: the advance
of the grinding wheels was entirely under microprocessor
control. Slamming it forward too hard, or even a failure of the motor
that rotated the workpiece could result in the crankshaft spinning so
fast it broke out of its head/tailstock, shattering against the floor
and the broken parts bouncing away at very high velocity.

Yup sounds like fun (not!)

(hardware interlocks are almost always a good thing)

Post by Alan J. Wylie
See
https://www.newall.org.uk/files/2017/03/Crankshaft-Grinders-Flyer.pdf

--
Cheers,

John.

/=================================================================\
| Internode Ltd - http://www.internode.co.uk |
|-----------------------------------------------------------------|
| John Rumm - john(at)internode(dot)co(dot)uk |
\=================================================================/

Fredxx

2024-10-07 21:29:58 UTC

Permalink

Post by John Rumm
Only tenuous links to DIY for this one, but anyone with an interest in
electronics, high power broadcast kit and software may find this tale of
https://wiki.diyfaq.org.uk/index.php/
Hardware_and_software_fault_finding_-_an_interesting_bug
Let me know if it needs more explanatory stuff to de-geek the technical
bits...

Couldn't you have disabled interrupts during the critical 25us window?

John Rumm

2024-10-08 09:54:21 UTC

Permalink

Post by Fredxx

Post by John Rumm
Only tenuous links to DIY for this one, but anyone with an interest in
electronics, high power broadcast kit and software may find this tale
https://wiki.diyfaq.org.uk/index.php/
Hardware_and_software_fault_finding_-_an_interesting_bug
Let me know if it needs more explanatory stuff to de-geek the
technical bits...

Couldn't you have disabled interrupts during the critical 25us window?

I think the answer to that is "Yes, but"....

By the time I had figured out there was a interaction with interrupt
handling, I had actually found the cause of the problem anyway, so there
was no point in doing a workaround at that point.

Disabling interrupts may have had other knock on effects. Plus the
whole, leaving a hardware design error in there bit is sure to bite
someone sooner or later.

(had the new CPU board hardware already been in service in multiple
customer sites round the world, then you might argue that a software
only "patch" might be preferable, but the software could only be updated
by a physical EPROM swap - so that would mean an engineer visit anyway,
at which point the mod wire patch would be easy enough to do)
--
Cheers,

John.

/=================================================================\
| Internode Ltd - http://www.internode.co.uk |
|-----------------------------------------------------------------|
| John Rumm - john(at)internode(dot)co(dot)uk |
\=================================================================/

Fredxx

2024-10-08 13:11:29 UTC

Permalink

Post by John Rumm

Post by Fredxx

Post by John Rumm
Only tenuous links to DIY for this one, but anyone with an interest
in electronics, high power broadcast kit and software may find this
https://wiki.diyfaq.org.uk/index.php/
Hardware_and_software_fault_finding_-_an_interesting_bug
Let me know if it needs more explanatory stuff to de-geek the
technical bits...

Couldn't you have disabled interrupts during the critical 25us window?

I suppose as I sit in both camps, a few lines of extra code, versus
scalpel, soldering iron and wire, a few lines of code win every time.

Obviously a risk analysis of the knock on effect would be required.

Post by John Rumm
(had the new CPU board hardware already been in service in multiple
customer sites round the world, then you might argue that a software
only "patch" might be preferable, but the software could only be updated
by a physical EPROM swap - so that would mean an engineer visit anyway,
at which point the mod wire patch would be easy enough to do)

Software updates are always easier to sell than we made a mistake and
need to hack away at your board!

It's not uncommon to disable interrupts to allow for atomic-like code
executions.

John Rumm

2024-10-08 16:45:04 UTC

Couldn't you have disabled interrupts during the critical 25us window?

I suppose as I sit in both camps, a few lines of extra code, versus
scalpel, soldering iron and wire, a few lines of code win every time.

With an understanding on the actual failure mode, then the safest bodge
would probably have been setting up a reserved area of address space in
ROM that was not available for code, and located that on the address
space window that shared a numeric equivalence with the IO device's
location in IO address space. That would mean that during an interrupt,
you would the *know* you would never be interrupting code running at any
of the "danger" addresses.

Post by Fredxx
Obviously a risk analysis of the knock on effect would be required.

That is where it gets messy - you would probably need to identify all
peripherals mapped into IO space, look at how "tightly" the address
decoding logic was for each was (if you are not short of IO space, then
it is not uncommon to only partially decode the peripheral so that it
gets a bigger space than it really needs - but that also tends to mean
that its control registers are then "echoed" several times at different
locations), and block those out as well. It might mean you then run out
of available ROM space or available contiguous ROM space for the code!

Post by Fredxx

Post by John Rumm
(had the new CPU board hardware already been in service in multiple
customer sites round the world, then you might argue that a software
only "patch" might be preferable, but the software could only be
updated by a physical EPROM swap - so that would mean an engineer
visit anyway, at which point the mod wire patch would be easy enough
to do)

Software updates are always easier to sell than we made a mistake and
need to hack away at your board!

I think in real life it would easier to just swap out the whole PCB for
one with latest software and all hardware mods in place rather than risk
on site mods or chip changes (especially with the production boards
where the ROM might be soldered in, and the board have conformal coating)

Still fortunately, this will all during hardware / software integration
- one of the the points of which being not only to get the thing
working, but also fix stuff and remove as many unwanted "features" as
possible *before* it gets into production or delivered to a customer.

Post by Fredxx
It's not uncommon to disable interrupts to allow for atomic-like code
executions.

Indeed - although that is not usually because your hardware is prone to
writing to random IO addresses :-)
--
Cheers,

John.

/=================================================================\
| Internode Ltd - http://www.internode.co.uk |
|-----------------------------------------------------------------|
| John Rumm - john(at)internode(dot)co(dot)uk |
\=================================================================/

Alan J. Wylie

2024-10-08 21:55:10 UTC

Permalink

Post by John Rumm

Post by Fredxx
It's not uncommon to disable interrupts to allow for atomic-like
code executions.

Indeed - although that is not usually because your hardware is prone
to writing to random IO addresses :-)

Referring back to my earlier post about grinding machines. The 8085 had
core store (non volatile randomly addressable memory). One of the things
I discovered (though it could have been a Heisenbug caused by the
presence of the ICE) was that when looking on the In Circuit Emulator
at the instruction backtrace after the power had been turned off on the
main machine was random instructions being executed as the power ramped
down.

Since the calibrated zero position of the wheels was stored in core, and
since there was a non-zero possibility of it being corrupted during a
power down, we had to add extra code to ensure that it was verified on
power up to prevent the wheelhead from randomly accelerating.

--
Alan J. Wylie https://www.wylie.me.uk/

Dance like no-one's watching. / Encrypt like everyone is.
Security is inversely proportional to convenience

John Rumm

2024-10-08 22:55:48 UTC

Permalink

Post by Alan J. Wylie

Post by John Rumm

Post by Fredxx
It's not uncommon to disable interrupts to allow for atomic-like
code executions.

Indeed - although that is not usually because your hardware is prone
to writing to random IO addresses :-)

That is where the "dying gasp" capability of some modern
microcontrollers is quite handy - they can detect power failure and
execute proper "tidy up" routines using the remaining juice in the
capacitors.

(which actually reminds me of another requirement on that transmitter.
The big vacuum tuning caps would "clunk" back to their closest plate
spacing when power was removed, since the servo motor was no longer
holding them in position. I remember saying to the engineer who specced
the new requirements something like "I am surprised you don't want us to
implement a capacitor "parking" capability to park them gently". The
moment I said that, I though "me and my big mouth!" as it was obvious
that he thought it was a good idea and added it to the requirements.

Turned out to be a bit of a PITA to implement. It was easy to add extra
parking steps to the end of the amp "off" state machine to do a graceful
park during normal shutdown, but it got tricky to avoid in various
special cases where you could not do it like when handling operating
exceptions like the arc detector triggering or detecting abnormal VSWR)

Post by Alan J. Wylie
Since the calibrated zero position of the wheels was stored in core, and
since there was a non-zero possibility of it being corrupted during a
power down, we had to add extra code to ensure that it was verified on
power up to prevent the wheelhead from randomly accelerating.

Yup you don't want a rapid unplanned disassembly just because someone
turned dot on and off again!
--
Cheers,

John.

/=================================================================\
| Internode Ltd - http://www.internode.co.uk |
|-----------------------------------------------------------------|
| John Rumm - john(at)internode(dot)co(dot)uk |
\=================================================================/

Chris J Dixon

2024-10-09 08:02:04 UTC

Permalink

Post by John Rumm
With an understanding on the actual failure mode, then the safest bodge
would probably have been setting up a reserved area of address space in
ROM that was not available for code, and located that on the address
space window that shared a numeric equivalence with the IO device's
location in IO address space. That would mean that during an interrupt,
you would the *know* you would never be interrupting code running at any
of the "danger" addresses.

I recall a strange behaviour that colleagues had to deal with on
a train's passenger information display.

It simply had a set of stored messages, externally triggered,
which scrolled across, when called.

Seemingly randomly, it would display gibberish. Eventually it was
noticed that this tended to happen at certain times of day.

Further investigation finally discovered that in the circuitry
that searched in the specified memory address for the chosen
message, there was a component that had light sensitivity. In the
right conditions it was corrupting the address to look in unused,
or unavailable, locations.

Chris

--
Chris J Dixon Nottingham UK
***@cdixon.me.uk @ChrisJDixon1

Plant amazing Acers.

Bob Eager

2024-10-08 20:33:23 UTC

Couldn't you have disabled interrupts during the critical 25us window?

I suppose as I sit in both camps, a few lines of extra code, versus
scalpel, soldering iron and wire, a few lines of code win every time.

I won't repeat it here, but here's a case where I had no choice but to use
(micro)code.

http://www.bobeager.uk/anecdotes.html#hwhack

--
My posts are my copyright and if @diy_forums or Home Owners' Hub
wish to copy them they can pay me £1 a message.
Use the BIG mirror service in the UK: http://www.mirrorservice.org
*lightning surge protection* - a w_tom conductor

Vir Campestris

2024-10-15 20:55:22 UTC

Permalink

I've had lots of interesting bugs in my time, but one is safely history.

We were getting reports from end users that just occasionally, for no
apparent reason, the system was reporting a disc timeout. And wouldn't
recover, had to be rebooted.

We took some test systems and fed them all sorts of horrible workloads,
and never ever EVER saw the problem. Lots of systems working really hard
for weeks on end. This went on for quite a while.

In the end one of our field engineers won himself a promotion - he found
out how to make it happen reasonably often, and it was fixed within a week.

It turned out that what the code was doing was firing off an IO request
to the disc then setting a timer in case the disc never came back. When
the request did come back it turned off the timer, and all was well.

In that description is the problem. Can you see it?

What happened was that the transfer was requested, but before the timer
was started the clock ticked and another task got to run. Long enough
that the disc transfer had completed before the first task got back
control. And then started the timer. If no other transfers were run
before the timer ran out then the disc error flag was set, and the
_next_ transfer failed.

The fix was to start the timer before the transfer. Slightly slower
maybe, but reliable.

Andy