Thursday, May 15, 2008

Breaking a complex system software problem and solving it in steps

Recently got the USB 2.0 interface working on the receiver boards. Initially it was a blackbox for me with a h/w ip and documentation of register interface and usb signals to look at. Of course this blog is a general description of general system software problems confronted in order to give a glimpse of how a complex system software problem is broken down and solved. Proprietary h/w and s/w stuff is not explained.

Expectation: To enable USB host controller with root hub powering 2 usb ports and analyzing the data bursts on the usb bus. The host controller packs data to the PHY which does bit slicing and append sync header and finally transceiver spits it out over the usb bus. We used really state of the art solid state drives for testing.

To envision USB architecture: Imagine USB host class driver sitting over the USB host stack which sits over the USB host controller driver layer. Normally in the h/w there is a host controller and a device controller (on the device side). The application sits over the host class driver.


Problem statement:
  • Ensure EHCI host controller driver was up and running!!
    • Ensure all the usb registers were initialized in the right way.
    • Use uncached volatile in all the read/writes with the registers. Otherwise we get m/c checks from the ARC.
    • Mapped the values with the USB documentation
    • Understood the offsets used by USBx code and also used scripts to check the values of the USB h/w registers after ensuring that ehci host init was correct and complete.
    • Ensure that the threads are working well and relinquishing the control at the right instant.
  • Ensure USB host stack is up and running!!
    • No hiccups faced
  • Ensure USB class driver is up and running!!
    • Described later in the article
  • Ensure Interrupts using some low level bsp changes and add some synchronization delay in the test application
    • Ensure that the cache is disabled or use cache safe
    • Ensure the following important interrupts:
      • Root hub status change signal interrupt (vital for device insertion/extraction)
        • This was ensured using above cache disable macros
        • Ensure that USB reference clock frequency is correct.
        • Ensure that the timer interrupts are working at the right frequency
      • End of Transaction interrupt (absolutely vital for async td process procedure)
      • Host error interrupt (vital for the host error interrupt in case of transactions error)
  • Using Scope
    • Using an oscilloscope to measure crystal frequency
    • See J/Ks and SoFs (Start of Frame – To understand more: Google it). This helped us in understanding that actually SoFs are going on to the bus.
  • Using Beagle Toolbox
    • This came in really handy as I could see the real transactions on the bus and I gained confidence that actually something is gong on to the bus.
      • Learnt to play with the toolbox
        • Understood how to connect and generate a trace
        • Generated test traces with different devices e.g. flash drive, camera, ipod, low-speed mouse, usb 2.0 sata adapter in order to understand the different traces in different scenarios. Found that using a high speed usb drive always result in the set_configuration at the end with an ack of 0 coming back.
  • Solving the final problem after the set_configuration command
    • Till now we understood that 2 important interrupts form the basis for USB transactions -> Root hub status change signal and the end of transaction interrupt
    • So, I thought that why storage instance of the device is not available to the application even when the configuration is getting set correctly? I found out that the logical entity for the device is getting freed even though device enumeration was correct.
    • I watched the following functions very closely in my iterations and also traces on the beagle toolbox.
      • Registeration of a USB controller driver with the USB stack
      • Interrupt handler
      • Device insertion
      • Connection change on the port
      • Create new device
      • Command Block Wrapper (CBW) that encapsulate the SCSI request
      • SCSI unit is ready for data transfer
        • I had special focus on the above 2 routines coz in my traces along with device enumeration routine, I could see the following :Class=0x08 SubClass=0x06 Prot=0x50 which basically means it is a SCSI storage device with a Bulk Only protocol
      • Storage activation
        • I found that this function was not called and the resources are getting freed before this point. So, I decided to look into the storage entry and storage thread entry and storage driver entry functions very very closely.
        • Storage entry functionality basically implements a state m/c for storage entity. I understood my problem lies at the class driver i.e the layer between host controller driver and the application. As host controller driver properly enumerates the device after setting the correct address and is able to read all the device descriptors correctly. So, I started looking for the above class driver functions related to storage at run time.

§ Storage activation is normally done after the memory allocation for Threads and memory allocation for the media structures used by our filesystem

§ Bingo!! I was excited here. I knew I was close.

          • Thread stack allocation required X bytes. It was okay and I checked the memory pool available after thread stack allocation.
          • Thread creation for storage entry was fine
          • Memory requirement for the file system media structures was a whooping 10X bytes + Y bytes for alignment. I simply didn’t have that big memory block free and therefore it returned memory as insufficient. I increased the memory pool in the application and I started getting DATA0 and DATA1 as seen by the beagle toolbox. I wrote my own program to create/read/write/close/delete files on the media and validate it with a PC. It worked alright.
          • Why didn’t I find it before??
            • No way to see USB bus transacitons without USB analyzer.
            • Taking step by step approach helped me in understanding USB code in-depth.
            • There was no error returned for a storage device not present. I repeatedly ensured that device enumeration is correct. I thought that the problem is at the lower bsp or the host controller driver rather than at the class driver.
            • There were 3 main threads in-operation
              • One main thread
              • Storage entry thread – Even if there was not much memory available, the control used to just pass to and fro to the main thread, driver entry thread and storage entry thread encompassing 100s of functions.
              • Host controller driver thread
I firmly believe in breaking down the problem statement in sub-tasks and then solving the subtasks by checking out the tested segments of the code. Getting a complex s/w work on a h/w IP is sometimes really time consuming and painful as there are 100s of variables in the system and normally some random variable (transceiver not spitting out data properly or s/w stuck in some memory pool assignment issue) would be there which hurt u the most. So, its always good to characterize your problem.



No comments: