This is the mail archive of the ecos-discuss@sources.redhat.com mailing list for the eCos project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Scheduler bug found and fixed



(This is a re-send - we think the original reply was delayed)

andrew.lunn@ascom.ch (Andrew Lunn) writes:

> Hi Nick and others.
> 
> I found a bug in the mlqueue schedular. It goes something like this...
> 
> An alarm function calls cyg_thread_suspend() on the current thread, ie
> the one that was running when the timer ISR went off.
> 
> cyg_thread_suspend correctly takes the thread off the run_queue[pri]
> and clears the queue_map bit. The alarm func finishes, all the
> remaining DSR are called. 
> 
> After this it decided to do a timeslice on the clock tick. The
> timeslice function then rotates the current run queue. Unfortunatly,
> the run queue is now empty, so it follows a null pointer into random
> memory and makes that the head of the queue. It then forces a
> reschedule. This works becasue the queue_map bit is not set on the now
> corrupt run_queue. The correct thread gets to run and the system seems
> happy.
> 
> Then something causes an add_thread onto the now corrupt run_queue. In
> my cause a call the cyg_thread_kill(). add_thread checks the
> consitancy of the run_queue. It finds the queue is not empty, but the
> queue_map bit is not set and throws a CYG_ASSERT.
> 
> The fix is to change the timeslice function to check the queue it
> wants to rotate actually has at least one thread on it. If not ask for
> a reschedule.
> 
> Here is a patch for the fix and to put an assert into the rotate
> function to check for rotating and empty queue.
> 

Thanks Andrew, I don't envy you the effort it took to find and fix
this.

I have actually applied a slightly different solution to yours. This
bug only occured when it became necessary to do the rotate in timeslice()
rather than call yield(), which had become unsuitable for use
here. The reason this problem never occured when yield() was being
called is because it has a test for the current thread being in
running state before it rotates the queue. So I've moved this test
over as well. I've still added the assert in rotate() however.

This will turn up in the next anoncvs release. Meantime, here is a
patch against the kernel source tree for what I've done:

Index: ChangeLog
===================================================================
RCS file: /cvs/ecc/ecc/kernel/current/ChangeLog,v
retrieving revision 1.393
diff -u -5 -r1.393 ChangeLog
--- ChangeLog	2000/09/13 11:47:31	1.393
+++ ChangeLog	2000/09/25 13:43:57
@@ -1,5 +1,15 @@
+2000-09-25  Nick Garnett  <nickg@cygnus.co.uk>
+
+	* src/sched/mlqueue.cxx:
+	Added test for current thread not runnable in
+	Cyg_Scheduler_Implementation::timeslice().  This is possible if a
+	prior DSR has caused the current thread to be descheduled. Added
+	an assert to Cyg_ThreadQueue_Implementation::rotate() for
+	additional paranoia. (This problem was originally identified and
+	fixed (differently) by Andrew Lunn <andrew.lunn@ascom.ch>.)
+
 2000-09-13  Jesper Skov  <jskov@redhat.com>
 
 	* tests/kexcept1.c (cause_exception): Use separate cause_fpe function.
 	* tests/except1.cxx (cause_exception): Same.
 
Index: src/sched/mlqueue.cxx
===================================================================
RCS file: /cvs/ecc/ecc/kernel/current/src/sched/mlqueue.cxx,v
retrieving revision 1.34
diff -u -5 -r1.34 mlqueue.cxx
--- src/sched/mlqueue.cxx	2000/09/08 16:38:53	1.34
+++ src/sched/mlqueue.cxx	2000/09/25 13:43:57
@@ -274,24 +274,30 @@
 #endif
 
         CYG_ASSERT( sched_lock > 0 , "Timeslice called with zero sched_lock");
 
         Cyg_Thread *thread = current_thread;
-        Cyg_Scheduler *sched = &Cyg_Scheduler::scheduler;
 
-        CYG_ASSERTCLASS( thread, "Bad current thread");
-        CYG_ASSERTCLASS( sched, "Bad scheduler");
+        // Only try to rotate the run queue if the current thread is running.
+        // Otherwise we are going to reschedule anyway.
+        if( thread->get_state() == Cyg_Thread::RUNNING )
+        {
+            Cyg_Scheduler *sched = &Cyg_Scheduler::scheduler;
+
+            CYG_ASSERTCLASS( thread, "Bad current thread");
+            CYG_ASSERTCLASS( sched, "Bad scheduler");
     
-        cyg_priority pri                               = thread->priority;
-        Cyg_SchedulerThreadQueue_Implementation *queue = &sched->run_queue[pri];
+            cyg_priority pri                               = thread->priority;
+            Cyg_SchedulerThreadQueue_Implementation *queue = &sched->run_queue[pri];
 
-        queue->rotate();
+            queue->rotate();
 
-        if( queue->highpri() != thread )
-            sched->need_reschedule = true;
+            if( queue->highpri() != thread )
+                sched->need_reschedule = true;
 
-        timeslice_count = CYGNUM_KERNEL_SCHED_TIMESLICE_TICKS;
+            timeslice_count = CYGNUM_KERNEL_SCHED_TIMESLICE_TICKS;
+        }
     }
 
     
     CYG_ASSERT( queue_map & (1<<CYG_THREAD_MIN_PRIORITY), "Idle thread vanished!!!");
     CYG_ASSERT( !run_queue[CYG_THREAD_MIN_PRIORITY].empty(), "Idle thread vanished!!!");
@@ -654,11 +660,13 @@
 
 void
 Cyg_ThreadQueue_Implementation::rotate(void)
 {
     CYG_REPORT_FUNCTION();
-        
+
+    CYG_ASSERT(queue != 0, "Rotating an empty queue");
+    
     queue = queue->next;
 
     CYG_REPORT_RETURN();
 }
 



-- 
Nick Garnett, eCos Kernel Architect
Red Hat, Cambridge, UK

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]