LISTSERV - CLEANACCESS Archives - LISTSERV.MIAMIOH.EDU

CLEANACCESS Archives

March 2006

CLEANACCESS@LISTSERV.MIAMIOH.EDU

	LISTSERV Archives
	CLEANACCESS Home
	CLEANACCESS March 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: CASes going down after upgrade to 3.6.1.1
From:	"Quast, Robert (InfoTechServ)" <[log in to unmask]>
Reply To:	Perfigo SecureSmart and CleanMachines Discussion List <[log in to unmask]>
Date:	Tue, 21 Mar 2006 10:02:34 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (221 lines)
We had the os fingerprinting problem several weeks ago (and TAC actually
designated it as a bug during our call). We had performed a  clean
(wipe/reinstall/reconfigure) "upgrade" to 3.6.0.1 in January.  It ran
fine for 2 months.  The week before the server started crashing I had
enabled the os finger printing checkboxes.  Here was our call trace when
the kernel panicked.

Feb 22 14:01:34 csmartsrv2 kernel: Unable to handle kernel NULL pointer
dereference at virtual address 00000017
Feb 22 14:01:34 csmartsrv2 kernel:  printing eip:
Feb 22 14:01:34 csmartsrv2 kernel: f8dd6e78
Feb 22 14:01:34 csmartsrv2 kernel: *pde = 00000000
Feb 22 14:01:34 csmartsrv2 kernel: Oops: 0000 [#1]
Feb 22 14:01:34 csmartsrv2 kernel: Modules linked in: click proclikefs
des blowfish cast5 serpent twofish aes_i586 ipsec autofs4 sunrpc kick
video button battery ac uhci_hcd ehci_hcd shpchp i2c_i801 i2c_core e1000
floppy ata_piix libata sd_mod scsi_mod
Feb 22 14:01:34 csmartsrv2 kernel: CPU:    0
Feb 22 14:01:34 csmartsrv2 kernel: EIP:    0060:[<f8dd6e78>]    Not
tainted VLI
Feb 22 14:01:34 csmartsrv2 kernel: EFLAGS: 00010202   (2.6.11-perfigo)
Feb 22 14:01:34 csmartsrv2 kernel: EIP is at
_ZN10OsDetector5matchEthht9IPAddress12EtherAddresshPhthjhjh9TimestampP8o
s_entry+0x238/0x5f0 [click]
Feb 22 14:01:34 csmartsrv2 kernel: eax: 00000001   ebx: 00000000   ecx:
00000000   edx: 00000002
Feb 22 14:01:34 csmartsrv2 kernel: esi: 00000000   edi: 00000034   ebp:
00002000   esp: f7025d88
Feb 22 14:01:34 csmartsrv2 kernel: ds: 007b   es: 007b   ss: 0068
Feb 22 14:01:34 csmartsrv2 kernel: Process kclick (pid: 2887,
threadinfo=f7024000 task=f7fef020)
Feb 22 14:01:34 csmartsrv2 kernel: Stack: 00000001 f395e880 00000008
00000000 f35d0000 00000000 00000000 00000000
Feb 22 14:01:34 csmartsrv2 kernel:        002b030a 00000000 08e24fc8
000005b4 f7025ecc 063a030a fe18030a 01804fc8
Feb 22 14:01:34 csmartsrv2 kernel:        c238f380 0000037c 00000005
000002b4 00000005 00000000 00000000 00000000
Feb 22 14:01:34 csmartsrv2 kernel: Call Trace:
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dd8dad>]
_ZN10OsDetector5parseEPK6Packet+0x3fd/0x6d0 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dc7d96>]
_ZN13IPFilterGroup4pushEiP6Packet+0x106/0x120 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dbee0d>]
_ZN13HashIPLookup54pushEiP6Packet+0xbd/0x140 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dd909e>]
_ZN10OsDetector4pushEiP6Packet+0x1e/0x40 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8d89d30>]
_ZN8IPFilter4pushEiP6Packet+0x70/0x100 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dc15bb>]
_ZN12HashIPTable24pushEiP6Packet+0x6b/0xe0 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8d89d30>]
_ZN8IPFilter4pushEiP6Packet+0x70/0x100 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8dabf6a>]
_ZN11ARPQuerier34pushEiP6Packet+0xea/0x170 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8ddc83c>]
_ZN12RoamingAgent4pushEiP6Packet+0x8c/0x1b0 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8de10b5>]
_ZN10Classifier4pushEiP6Packet+0x75/0xe0 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8da07b6>]
_ZN10FromDevice8run_taskEv+0x116/0x210 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8d687ff>]
_ZN12RouterThread6driverEv+0x1af/0x2b0 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8df966f>]
_Z11click_schedPv+0x6f/0x120 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<f8df9600>]
_Z11click_schedPv+0x0/0x120 [click]
Feb 22 14:01:34 csmartsrv2 kernel:  [<c01012c5>]
kernel_thread_helper+0x5/0x10

Other than this problem the "upgrade" went smoothly and we haven't had
any other problems.  We're waiting till the summer to implement the
router blocking policy.

Rob Quast
Central Connecticut State University
Information Technology Services
Technical Services
[log in to unmask]



-----Original Message-----
From: Perfigo SecureSmart and CleanMachines Discussion List
[mailto:[log in to unmask]] On Behalf Of Jason Richardson
Sent: Monday, March 20, 2006 7:34 PM
To: [log in to unmask]
Subject: Re: [PERFIGO] CASes going down after upgrade to 3.6.1.1

No apology necessary.  As others have said, with you following this
listserv so closely, this is the single most useful resource that we
have for CCA support issues.   I included everything that my associate
wrote down from the console when he noticed the kernel panic.  If we see
it again, we'll get everything.  I'll look for 3.6.2 tomorrow and we'll
look forward to applying the patch and re-enabling OS fingerprinting. We
were looking through our logs after disabling it, and we had well over a
hundred users who were being caught by it, trying to user the user-agent
work-around.

Thanks,

Jason

>>> [log in to unmask] 03/20/06 6:07 PM >>>
Jason,

Sorry for the inconvenience. 
TAC cannot be faulted for this since they may have reached that
conclusion based on the symptoms which is that the machine(CAS) stops
communicating.   

Actually, I can be certain of the issue if you send me the first few
lines following the kernel panic.  As I mentioned in my previous email,
I really need to see the messages to confirm one way or the other.  

If it is indeed the issue I was refering to, then the answer to your
question is yes (both those options should be turned off and the CAS
restarted - service perfigo restart).  

And I do understand that this is a very desired feature.  We have fixed
the issue and we included it in 3.6.2 which is scheduled to be released
this evening.  I understand that it would be another minor upgrade for
you (and would involve either a UI upgrade or command line based
upgrade).  However, you would be able to use the OS fingerprinting
feature once 3.6.2 is applied. 

Once again, sorry for the inconvenience. 

-Rajesh.

-----Original Message-----
From: Perfigo SecureSmart and CleanMachines Discussion List
[mailto:[log in to unmask]] On Behalf Of Jason Richardson
Sent: Monday, March 20, 2006 3:03 PM
To: [log in to unmask]
Subject: Re: CASes going down after upgrade to 3.6.1.1

Hi Raj, I really appreciate the quick reply.  Unfortunately, we spent
two hours on the phone with TAC and this never came up.  What they had
us do was a firmware upgrade of the BCOM NICs in the servers (we're
running MCS-7825-H1's with the BCOM 5702X NIC) which we just completed.

We have not upgraded the firmware in the other two CASes or the CAMs yet
although they all have the same BCOM NIC.  When you say disable "OS
fingerprinting" do you mean uncheck both of the boxes - "Set client OS
to WINDOWS_ALL when Win32 platform is detected" and "Set Client OS to
WINDOWS_ALL when Windows TCP/IP stack is detected (Best Effort Match)?"

This is a real bummer since this is the feature that we were looking
forward to implementing the most.

Thanks,

Jason

---
Jason Richardson
Manager, Security Systems
Enterprise Systems Support
Northern Illinois University

>>> [log in to unmask] 3/20/2006 4:21:38 PM >>>
Jason,

We have recently discovered an issue with the OS fingerprinting feature
that can cause a kernel panic (machine hanging).  This issue is fixed in
3.6.2 which should be released late tonight.  

To see if this is the issue affecting your machines, please turn off the
OS detection feature on te machine that is crashing and  see if that
"fixes" the problem.  If that is the case, then I would recommend that
the OS fingerprinting feature be turned off until 3.6.2 is applied.
Note that this only happens in certain situations where there is a
client that deliberately sends certain null headers/mismatched TCP
headers.  Of course, when 3.6.2 is applied, you can turn the feature
back on. 

Jason, could you send the messages (you can send them to me offline)
that appear on the console at the time of kernel panic?  That will help
better identify the root cause.

-Rajesh.

-----Original Message-----
From: Perfigo SecureSmart and CleanMachines Discussion List
[mailto:[log in to unmask]] On Behalf Of Jason Richardson
Sent: Monday, March 20, 2006 1:22 PM
To: [log in to unmask]
Subject: CASes going down after upgrade to 3.6.1.1

Hi all, has anyone else had problems with their CASes after upgrading to
3.6.x?  Our 2 CAMs and 4 CASes were running fine after our upgrade last
week, but the students weren't back from break yet.  We came in this
morning to a trouble ticket from students reporting that they could not
login.  Upon investigation we found one CAS totally unresponsive -
disconnected from the CAM and wouldn't respond to a ping or SSH.  We
literally had to power cycle it to get it back and that seemed to
resolve the problem.  This afternoon we got another trouble ticket
reporting the same problem and found another CAS in the same state with
a message on the console of "kernel panic - not syncing, fatal exception
in interrupt."  The interesting thing about the second one is that when
we did the upgrade it took almost 2x as long as to install the OS as the
others and 2x as long to reboot after we applied the 3.6.1.1 patch. 
The
first one that went down installed just fine, but also took a long time
to come back after applying the .1 patch.

We're on the phone with TAC now, but we were just wondering whether
anyone else had had similar problems.

Simon, to answer your question, we completed the entire upgrade of 2
CAMs an4 CASes in under three hours and we considered it a total success
until today.

Thanks,

---
Jason Richardson
Manager, Security Systems
Enterprise Systems Support
Northern Illinois University
ATOM RSS1 RSS2
LISTSERV.MIAMIOH.EDU