This is the mail archive of the ecos-discuss@sourceware.cygnus.com mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: Two TCP/IP stack issues...


On 11-Apr-00 Grant Edwards wrote:
> 
> First, TCP/IP panics
> --------------------
> 
>> My application that's running eCos (ARM7TDMI) and the TCP/IP
>> stack is seeing a panic from the stack after a minute or two
>> under certain conditions.  The panics we're running are
>> 
>>     m_copydata: null mbuf in skip
>>     m_copydata: null mbuf
>>     sbdrop
> 
> After a bit of rooting around with gdb, we've determined that
> these panics are the resulf of an sb struct getting corrupted.
> 
> The TCP/IP routines appear to be using two macros to attempt to
> provide mutex access to the sb struct:
> 
> [from tcpip/v1_0a1/include/sys/socketvar.h]
> 
> /*
>  * Set lock on sockbuf sb; sleep if lock is already held.
>  * Unless SB_NOINTR is set on sockbuf, sleep is interruptible.
>  * Returns error without lock if sleep is interrupted.
>  */
>#define sblock(sb, wf) ((sb)->sb_flags & SB_LOCK ? \
>               (((wf) == M_WAITOK) ? sb_lock(sb) : EWOULDBLOCK) : \
>               ((sb)->sb_flags |= SB_LOCK), 0)
> 
> /* release lock on sockbuf sb */
>#define        sbunlock(sb) { \
>       (sb)->sb_flags &= ~SB_LOCK; \
>       if ((sb)->sb_flags & SB_WANT) { \
>               (sb)->sb_flags &= ~SB_WANT; \
>               wakeup((caddr_t)&(sb)->sb_flags); \
>       } \
> }
> 
> 
> These are used by normal foreground tasks, not DSRs or ISRs,
> right?. Context switches can occur in the middle of accesses to
> the sb_flags field, allowing two tasks to access the sb struct
> simultaneously and corrupt it.
> 
> When we set our user-task priority to be the same as the eCos
> network task, then the corruption of the sb structs stopped.
> [We have time-slicing disabled.]
> 
> 
> Q: Don't we need to serialize accesses to sb structs with
>    mutexes?
> 
> 

Ah the joys of importing old code :-)

As I'm sure you can see, this TCP/IP stack was imported, as directly as
possible, from the current OpenBSD sources.  It was felt that this approach
would lead to a workable system in the least amount of time.  Also, by
using an established Open Source code base, we would gain the experience
of that code base.  A concious effort was made to minimize any changes
to the original sources so that improvements in the OpenBSD world could
be easily tracked and merged back into the eCos version.

However, as can be seen by these problems, there are problems and
concerns when using legacy code.  Some things that "just worked" because
of certain assumptions in one code base will not work in another.  The
problem you have uncovered, while it did not show up in our testing,
is one such problem.  In traditional BSD systems, the kernel was not
truly pre-emptable, i.e. control over a CPU could not be lost while
executing in the kernel unless that particular kernel "thread" blocked.
In a truly pre-emptable system like eCos, this assumption is not valid
and problems like what you have encountered can occur.

Indeed, the correct solution will need to use proper kernel synchronizers.
We'll need to look at this more closely to determine a proper solution.

The things to think about are:
 * Do we need a single synchronizer or one per socket buf?
 * If more than one will be used:
   * how do they get allocated?  part of the socketbuf itself?
   * how are they initialized?  destroyed?
 * Are there other such critical data structures, similarly "protected"?

> 
> 
> Second, tcp_echo 
> ----------------
> 
>> In trying to duplicate the problem with some of the eCos test
>> programs, I tried to lower the buffersize in tcp_source.  It
>> seems to work fine down to about 100 bytes, but below that
>> starts to fail.  I've tried buffer sizes of 60-70 bytes and
>> after a second or two, the data trasfer just stops.  Sometimes
>> I get "setsoftnet" messages on the diagnostic port.
>> 
>> Q: Should buffer sizes of 60-70 bytes work in the tcp_echo
>>    test? I can't see anything in the source that leads me to
>>    believe short buffer sizes should fail.
> 
> When running the tcp_echo program with the default task
> priorities, there are long pauses in IP traffic (1 to 10
> seconds).  Sometimes things clog up long enough that we run out
> of mbufs and panic. If the priorities are changed to
> 
>#define IDLE_THREAD_PRIORITY     CYGPKG_NET_THREAD_PRIORITY+3
>#define LOAD_THREAD_PRIORITY     CYGPKG_NET_THREAD_PRIORITY+1
>#define MAIN_THREAD_PRIORITY     CYGPKG_NET_THREAD_PRIORITY-0
> 
> then data flows smoothly, and things never back up by more than
> one ethernet packet.
> 
> Having the main thread priority at the old, higher level
> (CYGPKG_NET_THREAD_PRIORITY-2) appears to prevent the network
> task from running and processing incoming packets.  
> 
> 
> Q: Why can't the network task run when the main thread is
>    blocked on a read()?
> 

This one may be related or not, it's not immediately clear.

I suggest that we solve problem #1 and see where that leads.

> 
> 
> I don't know if these two situations are related or not, but
> I'd like to take a shot at trying to fix them -- any clues
> anybody would care to lend would be appreciated.
> 

I appreciate your care looking into these problems.  Remember
that this is "beta" code - we ourselves are sure that bugs
remain.  That said, a cooperative effort in finding good
solutions seems to be a track that will benefit us all.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]