[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

19990607: archive methodology at U Illinois



>From: David Wojtowicz <address@hidden>
>Organization: University of Illinois
>Keywords: 199905050346.VAA11109 data archive

David,

Thanks for giving me a nice heads-up on how you are saving archive data
at UI.  I will be rolling this over in my mind as we work on the system
in NCAR/SCD.

One thing that both Chiz and I agree on is that model data really should
be saved by model and by runtime.  The other thing that I have done is
save the FSL2 wind profiler data into individual files that sites
can grab.  The main reason for this is that these files are already
in netCDF, so they really don't need to be processed by any decoder.

Our first shot out of the box will be FTP access to data that can be
as old as 5 days (don't have enough disk for more yet).  We will develop
a simple web interface where links are FTP actions.  Eventually,
of course, something much more grand is called for.  We are hoping
that interested community members will be willing to contribute to the
development effort since the big benefit is access to the data on
the NCAR Mass Store.

Thanks again for your input.  It was much appreciated.

> I know you've asked this before, but I never got back to
>you because I was working on something better but have
>been thwarted in my attempts due to the great troubles
>we're having with LDM on Linux....it seems to be
>especially problematic on the new 2.2 kernel.
>
> What were doing now is so unfancy that you'll be sorry to have
>gone to the trouble of asking.
>
># FEED STORAGE
>
>DDPLUS  ^.*
>        FILE    /home/data/feeds/DDPLUS/%Y%m%d%H.DDPLUS
>IDS     ^.*
>        FILE    /home/data/feeds/IDS/%Y%m%d%H.IDS
>MCIDAS  ^.*
>        FILE    /home/data/feeds/MCIDAS/%Y%m%d%H.MCIDAS
>HRS     ^.*
>        FILE    /home/data/feeds/HRS/%Y%m%d%H.HRS
>
>
>I have a simple cron script to compress the files with
>gzip after they're done and remove the old ones.
>
>The ftp server just accesses data from the DDPLUS,IDS,MCIDAS
>and HRS directories.
>
>This files can then (with the exception of the MCIDAS) feed,
>be refed into the LDM with the pqing program.
> 
>  gunzip -c 1999060412.DDPLUS.gz |  
>     pqing -f ddplus -p none -n -
>
>This is convenient because the data is reingested into your
>system and put through all the same processing steps that
>real time data goes through, so if you've missed this data
>and are recovering it, everything gets restored just as if
>you had received it in real time.
>
>However, there's some big problems with this approach.
>
>One is that it does not work for data more than a week or two 
>old.  The problem lies in the fact that the WMO headers only
>encode the day of the month and the time, but not the month
>or year.  LDM therefore always assumes that the month is
>the current one when preforming date substitution in pqact.
>If you  try to restore data from two months ago with this
>method, pqact will generate the wrong filenames.   And programs
>that use the data such as decoders will be confused too. (With
>HRS there is still a chance, because the full date is encoded
>in the GRIB headers, but not so with many DDPLUS decoders)
>
>The other problem is that pqing only works with data broken
>into WMO messages, and therefore excludes non-WMO message
>data like the MCIDAS datastream, and other feeds such
>as NLDN, NMC2, DIFAX, WSI, FSL.
>
>Feeding it back into your product queue can cause problems
>too....including sending the data downstream to sites that
>don't want it and overwelming your product queue, because it
>can be feed in faster than it can be processed using this method.
>
>
>What I was in the process of proposing was to store the
>data in an alternate format, that has built in delimiters
>so that it could be used with all data types.  This would
>be similar to storing the via the SPIPE method:
>
> HRS     ^.*
>        SPIPE   cat >/home/data/feeds/HRS/%Y%m%d%H.HRS
>
>which adds a special delimiter between each product that includes
>a key, a product identifier and product size.  Except however
>I'd expand this to include some additional information
>including the feedtype and product date/time.
>
>In this way, the contents of these saved files would be
>sufficiently internally self-describing to be reprocessed
>correctly.
>
>On the other end, there'd be a patch to pqact that would
>allow it to fetch products for processing from this
>archive file rather than the product queue.  It would
>use the delimiter/headers to split the products apart
>and fill in the product info structure that it normally
>gets from the product queue with the correct feedtype,
>date, etc and therefore be able to process it correctly.
>Be avoiding the product queue it would make it faster
>because it wouldn't have to write a copy to the queue
>first and you wouldn't have to worry about the queue
>being overwhelmed.
>
>Anyway, I had started to write up a small white paper describing
>this, but had stopped when I ran into the trouble of making
>LDM work properly under Linux.
>
>
>Future ideas included a web interface that would allow you
>to request the archive files applying a filter to them
>to get only the data you were interested in.
>
>--------------------------------------------------------
> David Wojtowicz, Research Programmer/Systems Manager
> Department of Atmospheric Sciences Computer Services
> University of Illinois at Urbana-Champaign
> email: address@hidden  phone: (217)333-8390
>--------------------------------------------------------

Tom Yoksas