[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

20010208: packet sniffing sheds a little light (cont.)



>From: Leigh Orf <address@hidden>
>Organization: UNCA
>Keywords: 200102052335.f15NZkX27276 McIDAS-XCD

Leigh,

I just wanted to quickly touch base with you to let you know that I
am still hammering away at the sounding serving problem we have run
into on your system.  I have a request in at another site to give
me login access for some tests.  The reason for this is that they
are running RedHat 5.2.  It will be interesting to see if the failure
with sounding servicing started with RedHat 6.x (i.e., libc 2.0).

>Some more puzzle pieces here. I ran a packet sniffer on my home machine
>while viewing a skewt. I was lucky enough to capture a succesful
>one from home with storm2 (in the morning when the network isn't
>saturated) along with three unsuccesful ones. I used a very cool program
>called ethereal. I used version 0.8.14 which was compiled with GTK+
>1.2.8, libpcap 0.4 and libz1.1.3 if you want to view this stuff for
>yourself. See http://ethereal.zing.org for info if you want to build
>this (rpms are available for Linux).

Thanks for the reference.

>I have created a /tmp/mcidas directory on storm2 and put the files
>strace in there. There is a subdirectory called ethereal with a file
>containing the ethereal file with all the traffic in a file which can be
>opened by ethereal, and two binary stream dumps, one of a succesful read
>and one of a failure.

OK.

>Here's a summary of what I found in case you don't want to muck with
>this yourself.
>
>Each ADDE skewT load happens in two parts. I will use an example from
>my machine. A connection is made between local (home machine) port 1069
>and remote (storm2) 500 (mcserv). This stream is *always* succesfully
>complete. Some text strings from this transfer include:
>
>MDFH
>USER
>MDFH
>RTPTSRC UPPERMAND POS=0 MAX=1 BPOS=1 EPOS=9999 VERSION=1                      
>                                           
>IRAB
>Mand. Level RAOB for 05 FEB 2001
>USER
>MDXX0016
>IRAB
>Mand. Level RAOB for 06 FEB 2001
>USER
>MDXX0017
>IRAB
>Mand. Level RAOB for 07 FEB 2001
>USER
>MDXX0018
>IRAB
>Mand. Level RAOB for 08 FEB 2001
>USER
>MDXX0019
>IRSG
>Sig.  Level RAOB for 05 FEB 2001
>USER
>MDXX0026
>IRSG
>Sig.  Level RAOB for 06 FEB 2001
>USER
>MDXX0027

You can see this (although it is harder) from the strace outputs.
The mandatory level MD files are being scanned to see which one
contains the data that is needed for the sounding.

>... etc.
>
>Then, a second connection is made between local port 1070 and remote
>mcserv.  This stream is the one that fails. Some ascii text from it (I'm
>just running strings on it):
>
>USER
>RTPTSRC UPPERMAND MAX=10000 SIG='RTPTSRC/UPPERSIG' SELECT= 'IDN 72317'
>'DAY 2001039' 'TIME 0' POS=ALL TRACE=0 VERSION=1 LAT 
>HLAT LON LEV P   T   TD  DIR SPD Z   IDN DAY TIMEZS  ST  CO  HMS MOD
>NREC
>HDEG DEG CHARMB  K   K   DEG MPS M       CYD HMS M   CHARCHARHMS         
>SFC 
>NC  US  
>IRAB
>1000
>NC  US  
>IRAB
>925 
>NC  US  
>IRAB
>850 
 ...

Now, the request is for:

o mandatory level data (IRAB)
o significant level data (SIGT)
o significant level data (SIGW)

>SIGT
>SIGT
>SIGT
>SIGT
>SIGT
>.
>.
>.
>SIGW
>SIGW
>SIGW
>SIGW

 ...


>The second stream is the one that is truncated.

Right, and it is truncated AFTER transferring all of the mandatory
level and significant wrt Temperature data.  It is failing near the
end of all transfers when doing significant wrt Wind data.

>What is interesting
>is that this stream is usually *almost* entirely read before
>it is truncated.

I noticed this also.

>In fact, for the two stream dumps I put in
>/tmp/mcidas/ethreal (which are of the second stream for each case),
>the difference between the successful and unsuccesful read was only
>four bytes!

Man.  I will have to seriously look through these files.

>But it's not always this close, if you look at the other
>unsuccesful packet traces sometimes it only gets about 90% through
>before getting truncated.

I seemed to notice this inconsistency also.  I also noticed that
old_mmap was being called.  When I tried to find an old_mmap entry
point in a library, I was stymied.

>So the most interesting part of what I learned through this is how close
>the second stream gets to being read before getting truncated. Some
>process is closing that pipe *just* before it can complete.  It seems to
>me the bugfix for this might be a one-liner, if you can just find where
>to put it :)

A one liner if it is in the McIDAS code, and a lot of hassle if it is
something in the OS.

>Anyway, that's probably as much debugging I'm gonna do on this, I figure
>you know the code a lot better than I do. Hope this helps.

This did help, thanks.  I will keep hammering away at this until I
come up with some sort of solution.

Tom

>From address@hidden Tue Feb 13 14:02:31 2001
>Subject: Re: 20010211: XCD GRID decoding, SOUNDINGS HODOGRAPH (cont.) 

Tom, 

>>On the model stuff: Running from Fkey gives this:
>>Command (on one line):
>>      FOUSDISP T OLAY +00 INT=00:30 DAY=2001042 MODE=X GRIDF=132 
>>      GRA=12 SF=YES PRO=CONF 
>>       OUT=PLO GU=GRAPHIC BLANK=NO               
>>Error:
>>      pipe read: Connection reset by peer
>
>Yikes!!!!  The pipe read error is the same one I am running into when
>trying to serve sounding data from Linux.

Uhhh...I don't think this is your error.  This is a semi-normal
nonsensical message from some programs using the socket interface.

After the sockets do an initial connect, a child is spawned and 
the port number is adjusted, internally to the kernel networking,
to free the parent socket for the next incoming connection request.
Otherwise only one program could ever connect to a port number.

Some client programs interpret this as an error and report this as a 
serious sounding message.  In fact, the connection was maintained 
throughout the process by the socket interface.  The connect() 
process usually spawns a child for accept(). For some brief instant
of time, there are actually two active ports. The remote clients often 
interpret this hand off as an event and kick out the above message.  It 
isn't an error...normal Un-ix multitasking stuff at the socket interface
is just being reported.

But, it does indicate that the network interface is being used.  Of course,
this is normal even on the same Unix machine, (sockets are easy to use
for IPC. Much easier than the shared memory, message queues and semaphores
of Sys V unix.) 

I was under the impression that this was how the ADDE worked?  

Thus, the LOCAL-DATA in the LOCDATA.BAT is redundant for sessions
originating on the same machine.  You should be able to substitute the
hostname for
LOCAL-DATA.  When you do, you will see this same message on the local
machine.
   
>
>>      FOUSDISP: Unable to execute PTDISP       command
>>      FOUSDISP - Done
>>      FOUSDISP failed, RC=1
>
I think this is the real error. Looks like a script error or a bad call,
or just plain bad communications between two programs:
        FOUSDISP and PTDISP
I am assuming that FOUSDISP is a wrapper around PTDISP.

>I just tried serving FOUS14 data from the same server at UNCA that I
>am having problems with (sounding serving), and it serves FOUS14 data
>just fine.  FOUSDISP plots FOUS14 data which, even though it is data
>from a model, is stored in a McIDAS MD file and is part of the RTPTSRC
>dataset.  I wonder what the pipe read error you are seeing is telling
>us about ADDE service on Linux!?
>
>Tom
Absolutly.
Hmmm...OK. This would be consist ant with what I am seeing. There are several
flags to the socket() call when the sockets are instanciated.  These are
macros in the /usr/include files....uhhh, socket.h, I think.
(Well, close.../usr/include/sys/socket.h) Red Hat does some weird stuff 
with their Protected Interface stuff...These are pi_socket.h and others.

ADDE has been pretty workable in the past, so, I suspect that the
RH people of taking liberties with the include files.
There are really only two major kinds of socket interfaces:
        AF_UNIX (or AF_LOCAL)
        AF_INET
        A local interface and a remote interface basically. (Pretty 
much corresponding to the two types of connections in LOCDATA.BAT.)

Of course, you know this stuff, soo, I'm just thinking out loud. But I
suspect the error is in there.

                        Gotta run, some personal support issues...

If I don't send it now, you won't get it till the morrow.
                                                        jdm