.: pingswept.org :.

May 01, 2012

Delayed by memory timing errors

(This is a pretty technical post, so if that's not your bag, the summary is that there was a problem with the last round of boards, so I'm making a new version that works. Back to the story.)

When I got two new Rascal prototypes back from the assembler last month, I was pretty excited. The new version, Rascal 1.1, cleans up a bunch of little annoyances and adds new features. The previous batch had sold out quickly and, even better, the Rascal was living up to my vision of a relatively easy-to-use board for connecting stuff to the internet.

After some testing, all the new features-- USB ports, the new JTAG port, and the like-- appeared to work well. Unfortunately, there was a problem with the Rascal's RAM, which meant that when the Rascal tried to write data to certain addresses, the data was lost. If that data turned out to be code that was later executed, the Rascal would choke and reset when it tried to execute whatever random data happened to be there before the failed write.

This all seems straightforward in retrospect. In futuro-spect, it took me about a week to narrow down the problem to failing RAM writes.

Figuring out the cause

After watching the board reset a few times, I tried logging the boot messages to see when resets occurred. Some tedious tallying yielded this table of last boot messages from the Linux kernel before reset occurred.

```language-bash Reps Message

1 Uncompressing Linux... done, booting the kernel. 4 RPC: Registered tcp NFSv4.1 backchannel transport module. 2 INIT: version 2.86 booting 1 mount: mounting /dev/mmcblk0p1 on /mnt/root failed: . . . 4 Setting up IP spoofing protection: rp_filter. 2 Lease of 192.168.10.190 obtained, lease time 43200 1 udevd (403): /proc/403/oom_adj is deprecated . . . 1 * Starting Avahi mDNS/DNS-SD Daemon: avahi-daemon...done. 2 Populating dev cache 1 INIT: Entering runlevel: 5 14 rascal14 login: ```

My fellow Artisan's Asylum denizen, Edison, noticed that the resets seemed to only occur after U-boot passed control to the Linux kernel. He suggested that we try making the board reset under U-boot. Looking through the list of U-boot commands, the memory test command, mtest, seemed like it might generate a decent load on the processor, so we gave it a try.

As it turns out, mtest revealed that the processor resets are probably caused by memory write errors in the RAM on the board. The idea is that the kernel gets written to RAM incorrectly when it is copied out of the serial flash. Faced with an invalid instruction in RAM, the processor does the only thing it can do, which is reset itself. Similar failures occurred on both new boards but did not occur with an older board.

The memory errors reported by U-boot looked like this:

language-bash Pattern 0000001E Writing... Reading... Pattern FFFFFFE1 Writing... Reading... Pattern 0000001F Writing... Reading... Pattern FFFFFFE0 Writing... Reading... Mem error @ 0x2124A33C: found 0049D711, expected FFB6D711 Mem error @ 0x2124A340: found 0049D710, expected FFB6D710

There's a pattern to the errors: the two lower bytes are correct, but the two upper bytes are the inverse of what they should be (0x0049 instead of 0xFFB6, and note that 0xFFB6 + 0x0049 = 0xFFFF). This pattern appeared most of the time. Sometimes, all four bytes of the found value were exactly the inverse of the expected value.

The mtest program writes to every RAM address alternating values from each end of the range of 32 bits, i.e. this sequence: 0xFFFFFFFF, 0x00000000, 0xFFFFFFFE, 0x00000001, . . ., but incrementing the value written by 1 for each memory location in the address space. The memory test code is on Github.

The fact that the data read back is the inverse of what we expect, with errors along byte boundaries, rather than random corruption, suggests that the problem is a timing issue, rather than a data or address problem. It looks like two bytes are failing to be written from time to time, so we read back whatever was written during the previous cycle of the memory test, which explains the inverse values. This leads to looking at signal integrity.

This matter called "signal integrity"

"Signal integrity" means making sure that the digital pulses on a circuit board pass between chips without distortion in time or voltage. Sharp-edged pulses, as used in most digital communication, get rounded off because PCB traces have a little bit of capacitance and a little bit of resistance, which together make a low-pass filter. For most signals, this is not a problem, but when you send pulses that are faster than around 100 MHz over distances of more than an inch or so, you have to start being careful. With slow signals, if the sharp edges of your signals get rounded off for a few nanoseconds, you don't care. On the other hand, if you're signaling at 100 MHz, your signals are only 10 nanoseconds long, so you can't afford a few nanoseconds of sagging. Debugging this kind of problem is made more difficult because the average oscilloscope is too slow to capture 100 MHz signals accurately; all the sharp corners get rounded off whether your pulses are getting distorted or not.

There are more problems beyond getting sharp transitions. As signals propagate down PCB traces, they can be reflected back wherever the impedance of the trace changes, just like a wave of water reflects off the side of a tub. These reflections settle out in a few nanoseconds, but that's still a problem for high frequency signals.

Also, you want all of your signals to arrive at the same time. Electrical signals propagate through copper at around half the speed of light in a vacuum, or around 6 inches per nanosecond. This means that for two traces that differ by an inch in length, you get a timing error of around 0.17 ns.

Signal integrity on the Rascal

Here's a cross-section of the Rascal circuit board. There are 4 layers of copper separated by 3 layers of fiberglass.

The thickness of the green layer (which is actually yellow in real life) means that a 5 mil wide trace has a characteristic impedance of 80-85 ohms. In theory, I could match this impedance with termination resistors near the memory chips to insure that signals don't reflect in nasty ways, but because the memory errors look like timing problems rather than noise from reflections on a few data lines and more parts mean higher costs, I decided not to add them at this point.

My friend Michael has an extremely fast oscilloscope. Using his scope, I was able to take a look at the signals on some of the memory control lines. Here's what they look like. (Apologies for the poor photo-- I wasn't intending to show it to the world.)

The whole screen shows shows two signals for 1/20,000,000th of a second. The green line is the 133 MHz memory clock pulse. The yellow line is the write-enable signal, which pulses low to signal that a write is taking place.

So what the hell does this picture mean? A rough summary is, "That yellow line looks like a noisy mess." In more precise terms, after the green signal crosses the minimum logical high voltage of 2.0 V (AKA V_IH,MIN), we need the yellow signal (write enable) to stay below the maximum legal logical low voltage, 0.8 V (V_IL,MAX) for 0.8 ns. You can't tell exactly from this picture, but the yellow signal rises right when the required hold time elapses. In theory, it should work, but given how noisy the signal is, it seems likely that it might slip to the wrong side of the line some of the time.

How do we fix it?

To make the memory timing work right, I wanted to do two things at the same time-- I wanted to make sure all the signals arrive simultaneously and that they're not jammed so close together that currents in one trace induce noise in adjacent traces.

To make the traces the same length, I delved into the fetid depths of the Altium API via Jscript. After Altium emitted the length of each leg, I added them together with a quick Python script. (I'll include the code at the bottom of the post.) This gave me a CSV file that allowed me to calculate the average and standard deviation of the net lengths.

With the help of this tool, I redid the connections between the RAM and the processor with even trace lengths. For the original Rascals, I laid the board out without particular attention to signal integrity. The average memory trace was 2.91 inches with a standard deviation of a whopping 1.24 inches. For Rascal 1.2, the average trace was reduced to 2.15 inches and the standard deviation was 0.23 inches. The picture below shows the original Rascal in blue on the left and the new version in red on the right. You can see that the blue version looks crazy; in the red version, the longest mismatch relative to the clock line roughly 5x better at around 0.7 inches, which corresponds to a delay mismatch of 0.1 nanoseconds.

I also rearranged the decoupling capacitors to minimize the length of their connections to the ground and power planes. I tried to space traces at least 6 mil apart from each other to minimize crosstalk between them. I could have done a more serious analysis of the trace spacing to minimize crosstalk, but as with termination resistors, I opted not to do it until I know I have to. My original, naive Rascal design worked, even though my traces were drastically different lengths, crammed together, and varying in impedance, so my hope is that a layout that doesn't do anything really stupid will succeed.

Rascal 1.2 on the way

The Rascal 1.2 PCBs have been sent out for assembly in Colorado; I should have the new Rascals back later this week. If they work, I'll have a larger batch up for sale in a few weeks as I already have the PCBs. Otherwise, Rascal 1.2 be cursed, and on to Rascal 1.3!

UPDATE

I got the new Rascals today (May 2, 2012) and they work. I haven't tested them 100% yet, but they both booted correctly, passed a few minutes of memory testing (~1 trillion write/read cycles without error) and loaded a bunch of files in the editor without a hitch.

Code

Here's the Jscript code for talking to Altium; feel free to do whatever you want with it. I'd recommend deleting it and then destroying all storage media it may have tainted.

```language-javascript var Board; //IPCB_Board; var Net; var Iterator; var ReportFile; var address_lines = ["A0","A1","A2","A3","A4","A5","A6","A7","A8","A9","A10","A11","A13","A14"]; // A12, A15+ omitted deliberately var data_lines = ["D0","D1","D2","D3","D4","D5","D6","D7","D8","D9","D10","D11","D12","D13","D14","D15","D16","D17","D18","D19","D20","D21","D22","D23","D24","D25","D26","D27","D28","D29","D30","D31"]; var control_lines = ["CKE","CLK","CS","CAS","DQMH1","DQMH2","RAS","WE"]; var term_res_lines = ["NetRA3_1","NetRA3_2","NetRA3_3","NetRA3_4","NetRA4_1","NetRA4_2","NetRA4_3","NetRA4_4","NetRA5_1","NetRA5_2","NetRA5_3","NetRA5_4","NetRA6_1","NetRA6_2","NetRA6_3","NetRA6_4"]; // NetRA1 and NetRA2 omitted deliberately var more_res_lines = [,"NetRA7_1","NetRA7_2","NetRA7_3","NetRA7_4","NetRA8_1","NetRA8_2","NetRA8_3","NetRA8_4","NetRA9_1","NetRA9_2","NetRA9_3","NetRA9_4","NetRA10_1","NetRA10_2","NetRA10_3","NetRA10_4","NetRA11_1","NetRA11_2","NetRA11_3","NetRA11_4","NetRA12_1","NetRA12_2","NetRA12_3","NetRA12_4","NetRA13_1", "NetRA13_2","NetRA13_3", "NetRA13_4", "NetRA14_2", "NetRA14_3"]; var critical_nets = address_lines.concat(data_lines, control_lines, term_res_lines, more_res_lines);

if (!Array.prototype.indexOf) { Array.prototype.indexOf = function(item) { var i = this.length; while (i--) { if (this[i] === item) return i; } } }

FileName = "C:\Net_Length_Report.Txt";

fso = new ActiveXObject("Scripting.FileSystemObject"); ReportFile = fso.CreateTextFile(FileName, true);

function ShowBusLength(){ Board = PCBServer.GetCurrentPCBBoard; Iterator = Board.BoardIterator_Create;

Iterator.AddFilter_ObjectSet(MkSet(eNetObject));
//Iterator.AddFilter_NetClass("U5-signals");
Iterator.AddFilter_LayerSet(AllLayers);
Iterator.AddFilter_Method(eProcessAll);
Net = Iterator.FirstPCBObject;

while (Net != null) {
    if (critical_nets.indexOf(Net.Name) >= 0) {
        ReportFile.WriteLine(Net.Name + "," + CoordToMils(Net.RoutedLength));
    }
    Net = Iterator.NextPCBObject;
}
ReportFile.Close();

ReportDocument = Client.OpenDocument("Text", FileName);
if(ReportDocument != null)
    Client.ShowDocument(ReportDocument)

} ```

Here's the Python script for summing lengths of multi-leg nets.

```language-python import csv

r = csv.reader(open('net-length-report.csv', 'rb')) d = {} for line in r: d[line[0]] = line1 d['zero'] = '0.0'

outfile = open('output.csv', 'wb')

buses = [ ('A0','NetRA11_1'), ('A1','NetRA11_2'), ('A2','NetRA11_3'), ('A3','NetRA11_4'), ('A4','NetRA12_1'), ('A5','NetRA12_2'), ('A6','NetRA12_3'), (you get the idea-- there were more nets listed here) ]

for leg1, leg2 in buses: total = float(d[leg1]) + float(d[leg2]) outfile.write(leg1 + ',' + str(total) + '\n')

outfile.close() ```

April 26, 2012

Open source hardware in Washington, D.C.

Last weekend, I traveled down to Washington, D. C. to visit family; while I was there, I went to an open hardware event put on by Public Knowledge on Friday afternoon. The event was organized by Michael Weinberg of Public Knowledge. Cat Johnson has a nice discussion with Michael explaining why they want to introduce policymakers to the idea of open hardware.

In reference to last year's similar event for 3D printing, Michael says,

"If you're a legislator and the first time you ever hear about 3-D printing is someone coming into your office saying, 'This horrible pirate box is ruining my business,' you would have one world view of 3-D printing. If the first time you hear about 3-D printing is someone saying, 'Wow, look at all these new businesses and people who are coming together around technology and creating all these amazing things,' you have a different world view about the subject."

Cat explains that with this approach, "Public Knowledge doesn’t have to spend the first half of a meeting explaining what it is they’re talking about."

The event consisted of two panels about open source hardware, followed by a demo session in the Rayburn Foyer. I watched the panels and had a table in the demo session. The demo I brought for the event is a motor that can have its speed controlled through the internet; the picture below shows its I-am-not-a-terrorist guise.

The two panels occurred in a small room on the third floor of the Rayburn office building. Each panel was 4 or 5 people. There were around 100 people there; at least 40 of them were open hardware folks; the remainder were either policy people or random geeks who wanted to see the action.

My hometown of Somerville is in Rep. Mike Capuano's district (he used to be mayor of Somerville), but Capuano and his aides were too busy to come to the event.

Where we stand

The panelists answered questions from the moderators, Alicia Gibb and Michael Weinberg. The event was filmed, so I won't try to reproduce all the questions and answers here, but it was definitely interesting and informative. Bunnie Huang had a great rant about how everyone involved in the production of electronics should visit the electronics markets in Shenzhen and feel the energy of commerce. (If that intrigues you, read his great blog post about Shenzhen from a few years ago.) All hardware, with some level of reverse engineering, is open, he pointed out. There are huge markets that don't have the strong IP rights enforcement we have in the US.

To me, the point is that we are making open source hardware even if we don't want to. This forces us to adopt openness by default, so that we get the benefits as well as the detriments.

I was encouraged to see that we have a bunch of smart people on our side. Bunnie, AnnMarie Thomas, David Mellis, and Nathan Seidle are articulate and funny. The demos were impressive. My favorite moment of the day was talking to a janitor who was drawn in by a RepRap 3D printer that was running unattended before the event officially started. Noticing that she was staring at the machine and seemed puzzled, I pointed out the printed-out plastic rabbits on the table and explained that the RepRap was halfway through printing another one. I showed her the pair of new Makerbot Replicators that can print two colors at once. She had the 3D printing epiphany and said something like, "This thing can make you anything you want!" She grabbed a few Makerbot fliers and as she left, she said, "I gotta tell people about this!" She had the altered-world urgency of someone who has just learned that Kennedy is still alive, or discovered that soylent green is people, or seen Nathan's secret tattoo. (Don't bother asking. He just pretends it doesn't exist.) I'm glad that we could help spread the enthusiasm.

The demos went well; I'd estimate that 50-100 people passed through, which is pretty good for a sunny Friday afternoon. I hadn't seen the very cool squishy circuits before; somehow, they managed to attract the children in the crowd, which is impressive for an event aimed at adults in suits. The Rascal demo worked perfectly, though I did have a couple moments of terror while setting up. First, it seems that in double-checking my demo, I managed to delete my DHCP server's configuration file. Fortunately, it was easy to switch the Rascal to a static IP. I guess I'm prepared for idiots like me.

Then, in my motor control demo, some of the Javascript code was linked from external servers. When I tested it the night before without an internet connection, the Rascal used cached versions of the code. 12 hours later, the cached versions expired, which left me scrambling for a wireless connection to pull down the code I needed. After some quick modifications to use local code, the demo worked quite well. Several people stopped by to ask when new Rascals would be available for sale, which made my day. (Thanks, Shawn!)

Still, my overall impression was that as a group, we open hardware folks are weak on the policy front. We don't know what policies we want. Worse, we don't even know what outcomes we want. Some of us really don't care about licenses; others think we need them desperately; and still others think licensing hardware is hopeless. We're just setting up the first open source hardware nonprofit organization. Our most experienced professionals are people like Nathan and Bunnie, who were in college 10 years ago. Generally, we seem like people much more interested in building stuff than public policy. In some contexts, our political inexperience is alarming; Michael Weinberg put it more optimistically: "No, that just means you've made prudent life choices!"

Thanks to Public Knowledge for organizing the event.

(More details about the state of the new Rascal boards coming soon.)

March 06, 2012

The first Rascal 1.1 prototypes have arrived

After a lot of ordering parts, tweaking of PCB layouts, and optimizing the bill of materials, the new Rascals have arrived from the assembler in Colorado.

Whenever I build a new version of the Rascal, the big question is, "Have I made some horribly stupid error that has turned all of the parts I sent away to the assembler into landfill stuffing?" So far, I haven't found any landfill-worthy errors.

Here's the list of changes.

Added second USB host port and stacked USB connector
Put in a new Ethernet controller (Micrel KSZ8051RNL)
Moved the I²C/TWI pins to the Arduino-compatible location
Added I²S port for streaming audio
Changed JTAG footprint to work with pogo pins more easily
Switched to black Samtec headers rather than the blue ones, which were chintzy

What works and what doesn't?

The new Rascal has two USB host ports. (Earlier Rascals have only one, and with a worse connector.) I plugged in an old Logitech webcam I picked up at the MIT flea market a few months ago and it was correctly identified by the kernel, as you can see in the command line snippet below.

```language-bash

Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub Bus 001 Device 002: ID 046d:08b0 Logitech, Inc. QuickCam 3000 Pro [pwc] ```

Both USB ports are at least wired correctly, but I haven't tested them more thoroughly yet.

The hairiest change in this rev of the Rascal was the change to a new ethernet PHY chip, the Micrel KSZ8051. The new chip costs less, uses less power, requires fewer external caps and resistors to make it work, and, most importantly, won't be obsolete any time soon. The macb driver in the Linux kernel correctly identifies the controller, as you can see in the snippet of boot spew below, but the link doesn't come up. I'm not sure where the problem is yet.

language-bash MACB_mii_bus: probed eth0: Atmel MACB at 0xfffc4000 irq 21 (02:71:82:06:00:14) eth0: attached PHY driver [Micrel KS8051] (mii_bus:phy_addr=ffffffff:00, irq=-1)

I haven't tested the I²S or I²C ports yet.

The new JTAG interface worked for programming the kernel and bootloaders into the serial flash on the Rascal. I had to make a new programming fixture, pictured below. The beige part was printed on the Uprint 3D printer at Artisan's Asylum. The brass pins are pogopins, i.e. their tips are spring-loaded like (upside-down) pogosticks. The 20-pin black connector at lower right will connect to an Atmel JTAG pod, which then connects via USB to a PC.

Having access to high-quality 3D printing on the cheap (that part cost around $10) is definitely changing the way I design stuff. For low volume plastic parts like this, it's a huge improvement over sending stuff out to be machined out of Delrin with a 5 day turn for $500.

Here's one last photo, which shows the Rascal in the fixture. The pogopins are under the PCB under the clamp.

If I can get the ethernet port working under Linux and no other problems emerge, I'll send the 20 remaining PCBs out for assembly, and they'll be available in the store for sale when I get them back. After that will come the first batch of 100 Rascals, which will be a huge milestone.

So far, so good!

January 31, 2012

Rascal 1.1 circuit board released for fabrication

I just sent the Rascal 1.1 PCB off to International Circuits for fabrication. I'm calling it version 1.1 because the last version, which was officially 0.6, AKA "the beta," turned out to be 1.0-level in quality.

Here's a screenshot of the final layout. (Click for a substantially larger version.)

Thanks to Jinbuhm Kim of Wiznet for the last-minute suggestion to add pin labels on the underside of the PCB (so you can see them after an Arduino shield is plugged in on top).

A full list of the hardware changes appears in the previous blog post. The software will be updated as well-- a new Linux kernel, a new driver to call Python from hardware interrupts, an improved web editor, and some awesome new HTML5 demos from the volunteer sharpshooting development team in England.

If you think the Rascal might be right for your next ridiculous internet machine, sign up for the announcement list to get alerted when they're ready.

older posts newer posts