This is the mail archive of the
mailing list for the eCos project.
Re: NAND technical review
- From: Ross Younger <wry at ecoscentric dot com>
- To: Jonathan Larmour <jifl at jifvik dot org>
- Cc: eCos developers <ecos-devel at ecos dot sourceware dot org>
- Date: Tue, 06 Oct 2009 14:51:20 +0100
- Subject: Re: NAND technical review
- References: <4AC6218C.email@example.com>
Jonathan Larmour wrote:
> I think at first the ball is really in Ross/eCosCentric's court to give
> the technical rationale for the decision, so I'd like to ask him first
> to give his rationale and his own perspective of the comparison of the
Here goes with a comparison between the two in something close to their
current states (my 26/08 push to bugzilla 1000770, and Rutger's r659).
For brevity, I will refer to the two layers as "E" (eCosCentric) and "R"
(Rutger) from time to time.
Note that this is only really a comparison of the two NAND layers. I have
not attempted to compare the two YAFFS porting layers, though I do mention
them in a couple of places where it seemed relevant.
BTW: I will be off-net tomorrow and all next week, so please don't think I
am ignoring the discussion...
1. NAND 101 -------------------------------------------------------------
(Those familiar with NAND chips can skip this section, but I appreciate
that not everybody on-list is in the business of writing NAND device
drivers :-) )
A chip comprises a number of blocks (a round power of two).
Each block comprises a number of pages (another power of two).
Each page has a "main" data area (512 or 2048 bytes on current devices) and
a "spare" - aka out-of-band or OOB - area (16 or 64 bytes respectively).
It's up to the driver and application to decide how they will use the spare
area, but it's usual for some of it to be given over to storing ECC data,
and there is space for a factory-bad marker (see below).
Programming the chip must be performed a page at a time (sometimes a 512
Erasing must be performed a whole block at a time.
By way of illustration, in the chip spec sheet I have to hand (Samsung
* 1 page = 2k byte + 64 spare
* 1 block = 64 pages
* The whole chip has 1024 blocks, making for 128MB (1Gbit) of data and 4MB
(32Mbit) of spare area.
Now, I mentioned ECC data. NAND technology has a number of underlying
limitations, importantly that it has reliability issues. I don't have a full
picture - the manufacturers seem to be understandably coy - but my
understanding is that on each page, a driver ought to be able to cope with a
single bit having flipped either on programming or on reading. The
recommended way to achieve this is by storing an ECC in the spare area: the
algorithm published by Samsung is popular, requiring 22 bits of ECC per 256
bytes of data and able to correct a 1 bit error and detect a 2 bit error.
There is also the question of bad blocks. Again, full details are sketchy. A
chip may be shipped with a number of "factory-bad" blocks (e.g. up to 20 on
this Samsung chip); they are marked as such in their spare area. (What
constitutes a "bad" block is not published; one imagines that the factory
have access to more test information than users do and that there may be
statistical techniques involved in judging the likely reliability of the
block.) Blocks may also fail during the life of the device, usually by the
chip reporting a failure during a program or erase operation. Because of
this, the manufacturers recommend that chip drivers scan the device for
factory-bad markers then create and maintain a Bad Block Table throughout
the life of of the device. How this is done is not prescribed, but the
behaviour of the Linux MTD layer is something approximating a de facto standard.
(ii) Chip comms protocol
Getting data into and out of the chip involves a simple protocol sequence.
Commands are single bytes; addresses are sequences of a few bytes depending
on the chip size and the operation invoked.
For example, to read a page of data on the spec sheet I have to hand is:
* Write 0x00 into the command latch
* Write the four address bytes in turn into the address latch
* Write 0x30 into the command latch
* Chip signals Busy; wait for it to signal Ready
* Read out (up to) 2112 bytes of data.
However, not all chips are quite the same. The ONFI initiative is an attempt
to standardise chip protocols and most new chips should comply with it. A
number of chips on the market are _nearly_ ONFI-compliant: deviations
typically occur over the format of the ReadID response and that of an
address. I believe that older chips did their own thing entirely.
Most, if not all, NAND chips have the same broad electrical interface.
There is a master Chip Enable line; nothing happens if this is not active.
Data flows into and out of the chip via its data bus, which is 8 or 16 bits
wide, mediated by Read Enable and Write Enable lines.
Commands and addresses are sent on the data bus, but routed to the
appropriate latches by asserting the Address Latch Enable or Command Latch
Enable lines at the same time.
There is also a ready/busy line which the driver can use to tell when an
operation is in progress. Typical operation times from the Samsung spec
sheet I have to hand are 25us for a page read, 300us for a page program, and
2ms for a block erase.
(iv) Board hook-up
What's more interesting is how the lines are hooked up to the board.
It is quite commonplace for a board based on a SoC to make good use of an
onboard memory controller or dedicated NAND controller. This allows the
controller to be programmed with the electrical profile the chip expects,
which makes life easy for the device driver: often, you just have to write
bytes to the relevant MMIO register address as fast as you wish and the
controller takes care of the rest.
If the NAND lines are connected to the CPU only as GPIO, the driver has a
lot of work to do in conforming to the correct signal profile at every step
of the chip protocol. (I haven't had to produce such a port, and I don't
think Rutger has needed one either, though he has produced an untested
In the case of a dedicated NAND controller, it is common to provide
hardware-assistance for ECC calculation. Where available, this provides a
significant speed-up (about 40% per page in my benchmarking).
Sometimes the ready/busy line isn't wired in or requires a jumper to be set
to route it. This can be worked around: for a read operation, one can just
insert a delay loop for the prescribed maximum time, while for programs and
erases, most (all?) chips have a "Read Status" command which can be used to
query whether the operation has completed.
It can be beneficial to be able to set up the ready/busy line as an
interrupt source, as opposed to having to poll it. Whilst there is an
overhead involved in context-switching, if other application threads have
much to do it may be advantageous overall for the thread waiting for the
NAND to sleep until woken by interrupt.
Of course, it is possible to put multiple chips on a board. In that case
there needs to be a way to route between them; I would expect this to be
done with the Chip Select line, addressed either by different MMIO addresses
or a separate GPIO or CPLD step. Theoretically, multiple chips could be
hooked up in parallel to give something that looks like a 16 or 32-bit
"wide" chip, but I have never encountered this in the NAND world, and it
would impose a certain extra level of complexity on the driver.
2. Application interface -----------------------------------------------
Both layers have broadly similar application interfaces.
In both layers, an application must first use a `lookup' call which provides
a pointer to a device context struct. In Rutger's layer, devices are
identified by device number; in eCosCentric's, by a textual name set in the
Both layers provide a means of finding out about the device. R's provides
a call which returns an info block; E's provides macros which retrieve
information from the device struct (which may also be queried directly).
The basic operations required are reading a page, programming a page and
erasing a block, and both layers provide these.
The page-oriented operations optionally allow read/write of the page spare
area. These operations also automatically calculate and check an ECC, if the
device has been configured to do so. Rutger's layer has an extra hook in
place where an application may explicitly request the use of cached reading
and writing where the device supports this.
Both layers also support the necessary ancillary operations of querying the
status of a block in the bad-block table, and marking a block as bad.
E's application interface also provides logic implementing partitions.
That is to say, all access to a NAND array must be via a `partition';
the NAND layer sanity-checks whether the requested flash page or block
address is within the given partition. This is quite a lightweight
layer and hasn't added much overhead of either code footprint or
The presence of partitions in E's model was controversial, as are its
fine details. Nevertheless, some notion of partitioning turns out to be
essential on some boards. In some recent work for a customer we identified
three separate regions of NAND: somewhere to put the boot loader (primary,
as booted by ROM, and RedBoot), somewhere for the application image itself
(perhaps FIS-like rather than a full filesystem), and a filesystem for the
application to use as it pleases.
R's interface does not have such a facility. It appears that, in the event
that the flash is shared between two or more logical regions, it's up to
higher-level code to be configured with the correct block ranges to use.
(b) Dynamic memory allocation
R's layer mandates the provision of malloc and free, or compatible
functions. These must be provided to the cyg_nand_init() call.
E's doesn't; instead it declares a small number of static buffers.
Andrew Lunn opined on 6/3/09 that R's requirement for malloc is not a major
issue because the memory needs of that layer are well-bounded; I think I
broadly agree, though the situation is not ideal in that it forces somebody
who wants to use a lean, mean eCos configuration to work around.
Also note that if you're going to run a full file system like YAFFS, you
can't avoid needing malloc, but in an application making simpler use of
NAND, it's an overhead that you may prefer to avoid.
3. Driver model --------------------------------------------------------
The major architectural difference between the two NAND layers is in their
driver models and the degree of abstraction enforced.
In Rutger's layer, controllers and chips are both formally abstracted. The
application talks to the Abstract NAND Chip, which has (hard-coded) the
basic sequences of commands, addresses and data required to talk to a NAND
chip. This layer talks to a controller driver, which provides the nuts and
bolts of reading and writing to the device. The chip driver is also called
by the ANC layer, and provides the really chip-specific parts.
The call flow looks something like this (best viewed in fixed-width font):
Application --(H)-> ANC --(L)-> Controller driver
\-(C)-> Chip driver
H: high-level interface (read page, program page, erase block; chip
L: low-level interface (read/write commands, addresses, data; query the busy
C: chip-specific details (chip init, parse ReadID, query factory-bad marker)
In eCosCentric's layer, a NAND driver is a single abstraction covering chip
init and querying the factory-bad status as well as the high level functions
(reading a page, etc). It is left to the driver to determine the sequence of
commands to send. How the driver interacts with the device is considered to
be a contract only between the driver and the relevant platform HAL, so is
not formally abstracted by the NAND layer.
E's chip drivers are written as .inl files, intended to be included by the
relevant platform HALs by whichever source file provides the required
low-level functions. The lack of a formal abstraction is an attempt to
provide a leaner and meaner experience at runtime: the low-level functions
can be (and indeed are, so far) provided as static inlines.
The flow looks like this:
Application --(H1)-> NAND layer --(H2)-> NAND driver --(L*)-> Platform HAL
H1: high-level calls (read page, program page, erase block)
H2: high-level calls (as H1, plus device init and query factory-bad marker)
L*: low-level calls, like L above but not formally abstracted
The two models have pros and cons in both directions.
- As hinted at above, the static inline model of E's low-level access
functions is expected to turn out to have a lower function call (and,
generally, code size) overhead than R's.
- R's model shares the command sequence logic amongst all chips,
differentiating only between small- and large-page devices. (I do not know
whether this is correct for all current chips, though going forwards seems
less likely to be an issue as fully-ONFI-compliant chips become the norm.)
If multiple chips of different types are present in a build, E's model
potentially duplicates code (though this could be worked around; also, an
ONFI driver ought to be written).
- A corollary of arguably inconsequential import: R's model forces the synth
driver to emulate an entire NAND chip and its protocol. E's synth doesn't
- E's high-level driver interface makes it harder to add new functions
later, necessitating a change to that API (H2 above). R's does not; the
requisite logic would only need to be added to the ANC. It is not thought
that more than a handful such changes will ever be required, and it may be
possible to maintain backwards compatibility. (As a case in point, support
for hardware ECC is currently work-in-progress within eCosCentric, and does
require such a change, but now is not the right time to discuss that.)
It would perhaps be interesting to compare the complexities of drivers for
the two models, but it's not readily apparent how we would do that fairly.
Perhaps porting a driver from one NAND layer to the other would be a useful
exercise, and would also allow us to compare code sizes. Any suggestions or
(he says hopefully) volunteers? I've got a lot on my plate this month...
4. Feature/implementation differences ------------------------------------
(I don't consider these to be significant issues; whilst noteworthy, I don't
think they would take much effort to resolve.)
The two layers' documentation differ in their depth and layout; these are
difficult for me to compare objectively, and I would suggest that a fresh
pair of eyes compare them.
I can only offer the comment that I documented the E layer bearing in mind
what I considered to be missing from the R layer documentation: it was not
clear how the controller and chip layers inter-related, nor where to start
in creating a driver. (I also had a lot less experience of NAND chips then
than I do now, and what I need to know now is different from what a newbie
(b) Availability of drivers
R provides support for:
- One board: BlackFin EZ-Kit BF548 (which is not in anoncvs?)
- One chip: the ST Micro 0xG chip (large page, x8 and x16 present but
presumably only tested on the x8 chip on the BlackFin board?)
- A synthetic controller/chip package
- A template for a GPIO-based controller (untested, intended as an example only)
I seem to remember rumours of the existence of a driver for a further
chip+board combination, but I haven't seen it.
E provides support for:
- Two boards: Embedded Artists LPC2468 (very well tested); STM3210E (largely
complete, based on work by Simon K; some enhancements planned)
- Two chips: Samsung K9 family (large page, only x8 done so far); ST-Micro
NANDxxxx3A (small page, x8) (based on work by Simon K)
- Synthetic target. This offers more features than R's: bad block injection,
logging, and a GUI interface via the synth I/O auxiliary.
- Further (customer-confidential) board ports.
(c) RedBoot support
E have added some commands for NAND operations and tested on the EA LPC2468
board. (YAFFS support works via the existing RB fileio layer; nothing really
needed to be done.)
(d) Degree of testing
There are presumably differences of coverage here; both E and R assert they
have carried out stress tests. Properly comparing the depth of the two would
be a job for fresh eyes.
- a handful of unit and functional tests of the NAND layer, and a benchmarker
- a number of YAFFS functional tests, one of which includes benchmarking,
and a further severe YAFFS stress test: these indirectly test the NAND
layer. (The latter has been run under the synth driver with bad-block
injection turned on, and has revealed some subtle bugs which we probably
wouldn't otherwise have caught.)
- the ability to run continual test cycles in their test farm
5. Works in progress -----------------------------------------------------
I can of course only comment on eCosCentric's plans, but the following work
is in the pipeline:
* Expansion of the device interface to better allow efficient hardware ECC
support (in progress)
* Hardware ECC for the STM3210E board driver
* Performance tuning of software ECC and of NAND low-level drivers
* Partition addressing: make addressing relative to the start of the
partition, once and for all
* Simple raw NAND "filesystem" for use by RedBoot (see
http://ecos.sourceware.org/ml/ecos-devel/2009-07/msg00004.html et seq; those
are the latest public mails but not the latest version of my thinking, which
I will update in due course)
* More RedBoot NAND utility commands
* Support for booting Linux off NAND and for sharing a (YAFFS) NAND-resident
* Part-page read support (would provide a big speed-up to parts of YAFFS2
inbandTags mode as needed by small-page devices like that on the STM3210E)
Embedded Software Engineer, eCosCentric Limited.
Barnwell House, Barnwell Drive, Cambridge CB5 8UU, UK.
Registered in England no. 4422071. www.ecoscentric.com