Old programming habits die hard

February 8, 2011 ~ Nate Lawson ~ 14 Comments

While programming, it’s enlightening to be aware of the many influences you have. Decisions such as naming internal functions, coding style, organization, threading vs. asynchronous IO, etc. all happen because of your background. I think you could almost look at someone’s code and tell how old they are, even if they keep up with new languages and patterns.

When I think of my own programming background, I remember a famous quote:

“It is practically impossible to teach good programming to students that have had a prior exposure to BASIC: as potential programmers they are mentally mutilated beyond hope of regeneration.”
— Edsger W.Dijkstra, June 18, 1975

Large memory allocations are a problem

A common mistake is keeping a fixed memory allocation pattern in mind. Since our machines are still changing exponentially, even a linear approach would quickly fall behind.

Back in the 90’s, I would make an effort to keep frames within 4K or 8K total to avoid hitting a page fault to resize the stack. Deep recursion or copying from stack buffer to buffer were bad because they could trigger a fault to the kernel, which would resize the process and slow down execution. It was better to reuse data in-place and pass around pointers.

Nowadays, you can malloc() gigabytes and servers have purely in-memory databases. While memory use is still important, the scale that we’re dealing with now is truly amazing (unless your brain treats performance as a log plot).

Never jump out of a for loop

The BASIC interpreter on early machines had limited garbage collection capability. If you used GOTO in order to exit a loop early, the stack frame was left around, unless you followed some guidelines. Eventually you’d run out of memory if you did this repeatedly.

Because of this, it always feels a little awkward in C to call break from a for loop, which is GOTO at the assembly level. Fortunately, C does a better job at stack management than BASIC.

Low memory addresses are faster

On the 6502, instructions that access zero page addresses (00 – ff) use a more compact instruction encoding than other addresses and also execute one cycle faster. In DOS, you may have spent a lot of time trying to swap things below the 1 MB barrier. On an Amiga, it was chip and fast RAM.

Thus, it always feels a bit faster to me to use the first few elements of an array or when an address has a lot of leading zeros. The former rule of thumb has morphed into cache line access patterns, so it is still valid in a slightly different form. With virtualized addressing, the latter no longer applies.

Pointer storage is insignificant

In the distant past, programmers would make attempts to fold multiple pointers into a single storage unit (the famous XOR trick). Memory became a little less scarce and this practice was denounced, due to its impact on debugging and garbage collection. Meanwhile, on the PC, segmented memory made the 16-bit pointer size insignificant. As developers moved to 32-bit protected mode machines in the 90’s, RAM size was still not an issue because it had grown accordingly.

However, we’re at a peculiar juncture with RAM now. Increasing pointers from 32 to 64 bits uses 66% more RAM for a doubly-linked list implementation with each node storing a 32-bit integer. If your list took 2 GB of RAM, now it takes 3.3 GB for no good reason. With virtual addressing, it often makes sense to return to a flat model where every process in the system has non-overlapping address space. A data structure such as a sparse hash table might be better than a linked list.

Where working set size is less than 4 GB, it may make sense to stay with a 32-bit OS and use PAE to access physical RAM beyond that limit. You get to keep 32-bit pointers but each process can only address 4 GB of RAM. However, you can just run multiple processes to take advantage of the extra RAM. Today’s web architectures and horizontal scaling means this may be a better choice than 64-bit for some applications.

The world of computing changes rapidly. What kind of programming practices have you evolved over the years? How are they still relevant or not? In what ways can today’s new generation of programmers learn from the past?

Building the ZoomFloppy slides

September 21, 2010September 28, 2010 ~ Nate Lawson ~ 6 Comments

At ECCC 2010, I presented these slides on the ZoomFloppy, a new device for accessing Commodore floppy drives from a PC via USB. The firmware, known as xum1541, has been available since fall 2009 for those who want to build their own board, but the ZoomFloppy is the first device that will be a complete product offered for sale. Jim Brain will be manufacturing and selling it by the end of the year.

The ZoomFloppy has a number of features beyond simple disk access, which is implemented in OpenCBM. It can also nibble protected disks using a parallel cable and nibtools. It is software-upgradeable and this presentation discusses some features that are planned for the future.

One surprising finding I made was that by running the 1571 drive in double-clocked (2 MHz mode), the hardware UART is just fast enough to enable transfer of raw bits, directly off the media. No one has every created a copier that took advantage of this “hidden” mode in the 25 years since the 1571 was introduced. Normally, this kind of transfer requires soldering a parallel cable into your drive. This mode works via the normal serial cable, but requires low-latency control of the bus that is only possible with a microcontroller (not DB25 printer port).

I also discuss how modern day piracy on the PS3 affected our chip supply and digress a bit to discuss old copy protection schemes. I hope you enjoy the presentation.

(Direct pdf download)

xum1541 now supports nibbler

December 23, 2009December 28, 2009 ~ Nate Lawson ~ 4 Comments

One thing I like about the holidays is the chance to finish off hobby projects. Earlier this month, I released the first beta of the xum1541, which is a C64 USB floppy adapter. The first release supported basic functions to read and write disks via the OpenCBM utilities.

Today, I finished testing for parallel nibbler support. The code is available in the OpenCBM cvs repository, and directions are on my xum1541 page. When used with nibtools, it can now copy protected disks and transfer data much faster than before. I’ve successfully tested both read and write support on Windows and Mac OS X. This is quite a milestone as it is the first USB interface to support the parallel nibbler protocol.

A bit of explanation is in order. The built-in interface for a 1541 floppy drive is serial and has CLK, DATA, and ATN signals. It is a serial version of the parallel IEEE-488 bus with conditions such as EOI signalled in-band instead of requiring separate wires. Commodore originally used IEEE-488 on the PET, but moved to the IEC serial protocol to cost-reduce the cables and avoid shortages in Belkin’s supply. The serial protocol is slower than parallel, but the legendary slowness of Commodore drives had more to do with attempting to maintain backwards compatibility with older drives, not the serial protocol itself. Third-party speeder cartridges fixed this in software by repurposing the serial signals for higher-speed signalling.

To get the full bitrate the drive mechanism is capable of though, hardware modifications were required. Copiers such as Burst Nibbler added an 8-bit parallel cable in addition to the serial lines. This was relatively easy since there are two 6522 IO chips in the 1541 drive. Each has two 8-bit IO ports, and one of them is not normally used. So the parallel cable can be connected to the unused lines. Since the drive ROM does not use these lines, the copier has to load a custom routine into the drive’s RAM while initializing. It is then activated to manage the data transfer.

When Commodore hardware died out, users still needed to transfer data to and from floppies. The X-series of cables was invented, using the PC printer port for interfacing. That worked for a while until Windows NT and above made it harder to get accurate inb/outb timing, and then the DB25 printer port disappeared completely. USB established itself as the next great thing.

USB is high bandwidth but also high latency. The bit-banging approach to interfacing via the printer port would no longer work. It takes around 1 ms to get data to a USB device, no matter how small. Since the 1541 drive mechanism transfers data at 40 KB/sec, that is about 25 microseconds per byte, much less than the latency. The xum1541 does all the handshaking with the drive in an AT90USB microcontroller running at 8 MHz, giving great accuracy. The data transfers to the host are done via a double-buffered hardware USB engine. It has a state machine that handles the actual USB signalling, so we can flip buffers while it is clocking data out to the host. This gives us the cycles we need for the drive.

The protocol is actually pretty simple. The setup routines, such as which track to select, signal a byte is ready for the drive by toggling ATN, while the drive toggles DATA to acknowledge it has seen it. The custom drive code reads these bytes and then jumps to the appropriate handler. When it is done, it sends back a status byte via the same protocol.

For the high-speed transfer, something even lighter weight is needed. The drive CPU is a 6502 running at 1 Mhz, which gives about 12 instructions per byte. The transfer protocol is started with a handshaked read or write as above. Then the drive begins to transfer data one byte at a time, toggling the DATA line each time a byte is ready. The microcontroller stays in sync by waiting for each transition and then reading a byte from the parallel cable. Thus, the path from the initial handshake to the data transfer loop must be very quick and then continue without interruption.

The parallel transfer gets you something else besides high speed. Many protection schemes were built on the fact that the 1541 only has 2 KB of RAM, not enough to store a full track, which is up to 8 KB. If a mastering machine wrote a track pattern that had many similarities, ordinary copying software that read the track in pieces could not be sure it lined up properly when reassembling the pattern on the backup copy. The protection scheme, which could read and analyze the entire track in one pass, would detect this difference and refuse to run the game. To duplicate this kind of disk, users either added 8 KB of RAM to the drive or added a parallel cable. Both allow an entire track to be read in a single pass.

It was fun implementing this protocol because microcontrollers are a dedicated platform. You can count clock cycles for your instructions and be guaranteed latency. Compared to desktop PCs, where you’re running concurrently with questionable software written by people who definitely don’t count cycles, this is a dream. If you make a mistake, it is your fault. There is nothing like an SMI handler that could lock the CPU for seconds while it handles a volume button press.

Happy Holidays from all of us at Root Labs!

C64 Christmas demo

xum1541 beta now available

December 11, 2009December 11, 2009 ~ Nate Lawson ~ 4 Comments

I’m proud to announce that the beta release of the xum1541 USB floppy adapter is now available. The firmware and host-side code are now available in OpenCBM cvs. See my xum1541 home page for information about building and setting up the first hardware, based on the Atmel AT90USBKEY development board.

This beta is pretty well-tested on Windows and Mac OS X, including error handling cases. However, both the device and host-side code is likely to change between now and the final release, so be sure you’re willing to upgrade if you want to start using it now. Notably, the nibbler support is still being debugged, so it isn’t enabled yet.

I’d like to thank Wolfgang Moser, Spiro Trikaliotis, and Christian Vogelgsang for testing various code drops, building their own devices, and providing good advice as things progressed.

Getting kids started in science and electronics

November 12, 2009October 28, 2009 ~ Nate Lawson ~ 3 Comments

I’m happy that there’s been a recent resurgence of the build-it-yourself mentality in the tech crowd. For a while, there was a dearth of interest in how electronics and low-level software work. If you were lucky enough to have engineers as parents, you may have grown up with electronic kits and mathematics. But if grew up in a small town like I did, you may have learned from an amateur radio operator.

Radio was the most popular tech hobby before personal computers became common in the early 1980’s. In the 1950’s, my parents’ generation built crystal radios or worked with awesome explosive and radioactive chemistry kits. In the 1960’s, kids built transistor radios or simple relay-based logic to play tic-tac-toe. I have great memories of visiting my relatives and finding these projects from their childhood in the closet. With the advent of multi-frequency scanners and cheaper radios in the 1970’s, amateur radio became even more popular than it ever had been. CB radio was even mainstream.

There were several HAMs that were friends of our family. Hal was a dispatcher for an air-conditioning repair service. He had a spare bedroom that was full of equipment. It was pretty magical to hear the morse code beeping out a message from a repeater on some distant mountain. In the evening, the teletype would constantly bang out words from HAMs all over the region, sending callsigns and weather reports to each other. It was the predecessor to IRC, IM, and texting.

Amateur radio and computers fit very well together. Hal gave my dad our first computer, a VIC-20, after he upgraded to a C64. He had used it to generate and decode morse code (CW) as well as log the various contacts he made. It was obvious to me that one of the best uses for a computer was to interface with other things.

Jim was an electrician, installing wiring in new buildings. He was also a HAM, although he built more of his own equipment than Hal. He would review various circuits I had drawn and recommend improvements. One circuit I designed was a clone of the game Lazer Tag. I was excited about my efficient circuit for the IR transmitter and receiver. However, he told me that while the circuit was mostly correct, I would need a matched lens pair since the IR LED would disperse too much to be reliably read. Also, my design would falsely trigger due to background noise like the sun because it wasn’t a coded channel, just a simple detector. Still, it was really fun to come up with new designs with his help.

One time he took me up to a nearby mountain where all the radio stations had their towers. The local HAM club had a repeater there in a rack with other equipment. He pointed out each repeater, including the ones for the commercial FM stations. I thought about what would happen if I pulled the plug on the country station.

They also had an automatic phone patch. This allowed you to make local calls from any radio by sending DTMF to the repeater. That was pretty amazing in a time where there were no cellphones. While phone patches still exist today, they’ve become a lot more rare. Still, they can be useful in disasters when the cell networks are down or overloaded and the closest working phone is far away.

I’m not sure what tech hobby today is as widespread as amateur radio was back then. Even people with blue-collar backgrounds were interested in it. While building your Arduino kits, be sure to invite the neighborhood kids. They might be the next Steve Wozniak or Bunnie Huang. I’m thankful for the HAMs that helped me when I was first getting started.

Update on xum1541 development

October 14, 2009October 5, 2009 ~ Nate Lawson ~ 4 Comments

Earlier this year, I announced the xum1541 project. This is a microcontroller board that connects a C64 floppy drive via USB to a PC. It is intended to run at a high enough speed to support copying protected disks. Here’s an update on the project’s recent progress.

When I first started this work, I examined the xu1541 adapter developed by Till Harbaum in 2007. It used an AVR microcontroller and a software USB stack. It was a neat project but had a few limitations. The goal for the xu1541 was to be as cheap as possible and use only through-hole parts. Since it used software USB, the microcontroller spent a lot of time bit-banging the USB port and so could not transfer data as fast as the 1541 could (especially with a parallel cable). Also, it required JTAG support to program the microcontroller the first time, something not all users would have. Still, it was a very neat project and is now available for purchase.

I started over with the AT90USB microcontroller. This device has a hardware USB engine that can run at full speed while the main CPU core is running your firmware. It also comes with a bootloader pre-programmed at the factory so users can install the firmware simply by plugging it into a USB port. There is a very nice open-source interface layer for this USB hardware by Dean Camera called LUFA. There are also many pre-built development kits so adding the IEC connectors is all the soldering that is needed.

The first version of the xum1541 was backwards-compatible with the xu1541. You could use it with the stock OpenCBM software from CVS. However, it had some limitations that made this approach a dead-end. The xu1541 works entirely via USB control transfers, which are not intended for high throughput. The AT90USB does not support double-buffering on the control endpoint. Even with a hardware USB engine, the control transfers hit a limit. There was no way I could get the latency down into the 25 microsecond/byte range needed to match the nibbler protocol. However, I did see a good speed increase over the xu1541 simply due to the hardware USB engine.

Thus, I decided to change to using two bulk endpoints, similar to the mass storage IO model that is implemented in USB flash drives. The AT90USB supports double-buffering for two endpoints of 64 bytes each in this configuration. This means that the hardware USB engine will clock data out the bus while the CPU is filling the other buffer. Then the pages are flipped and the process continues. With this approach, I could decrease the latency for nibbler support and get a performance boost for regular transfers as well.

This took a while to implement since it involved rewriting both the firmware and the host plugin component of OpenCBM to work together. Even though I made no effort to optimize the code, the results are already impressive.

Command	Before	Now
`d64copy -t p 8 output.d64`	48.016 sec	35.429 sec
`cbmctrl download 8 0xc000 0x2000 rom1.bin`	25.813 sec	18.988 sec

There are several beta testers who have built their own copy of the hardware and are testing this version. Once we have ironed out any remaining bugs, I will release the pinouts and first version of the code. One notable feature that will be missing for a little while is the nibbler support. It will require more tuning to work reliably. However, it can be supported simply with a software upgrade so there’s no reason to delay the xum1541 release once the basic feature set is stable, which should be soon. It’s already useful as a fast option for transferring unprotected floppy images. I have no plans to produce a commercial version of this product, but I expect someone will take the design and build a cost-reduced model with a nice enclosure.

Thanks for all the words of support and patience as this hobby project nears completion.