Re: [thredds] THREDDS Data Server serving from Amazon S3


Jeff,

I wrote some prototypes for Hyrax that utilized Glacier/S3/EBS to manage data. 
It was a proof of concept effort - not production ready by any means. It seems 
your idea was very much in line with my own. My thinking was to put EVERYTHING 
in Glacier and then spool things off Glacier into S3, and then from there into 
EBS as needed by the server. Things would get spooled from Glacier to S3 and 
then copied into an EBS volume for operational access. Last accessed content 
would get purged, first from EBS and then later from S3 so that eventually the 
only copy would be in Glacier, at least until the item was accessed again. I 
think it would be really interesting to work up a real system that does this 
for DAP services of science data!

Our experience with S3 filesystems was similar to Roy’s: The various S3 
filesystem solutions that we looked at did’t really cut it (speed & proprietary 
utilization of S3). But managing S3 isn’t that tough, I skipped the filesystem 
approach and just used the AWS HTTP API for S3 and it was quick and easy. 
Glacier is more difficult: Access times are long for everything. That includes 
4 hours to get an inventory report, despite the fact that the inventory is 
computed by AWS once every 24 hours. So managing the Glacier holdings by 
keeping local copies of the inventory is important as is having a way to verify 
that the local inventory stays in sync with the actual inventory.


Nathan

> On Jul 14, 2015, at 1:00 PM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> wrote:
> 
> 
> Glacier could be used for storage of all that data that you need to keep 
> around but rarely if ever access  - e.g., level-0 instrument output, raw 
> model output,  etc. If your usage model supports this type of latency then 
> the cost savings (1/10th) are significant
> 
> This is where hiding the storage semantics behind a file system breaks down. 
> The application can't be agnostic of the underlying storage as they need to 
> support delays in staging data, communicating to the end-user, caching, etc.
> 
> -Jeff
> 
> 
> 
> On Tue, Jul 14, 2015 at 1:35 PM, Robert Casey <rob@xxxxxxxxxxxxxxxxxxx> wrote:
> 
>       Hi Jeff-
> 
>       Of note, Amazon Glacier is meant for infrequently needed data, so a 
> call-up for data from that source will require something on the order of a 5 
> hour wait to retrieve to S3.  I think they are developing a near-line storage 
> solution that is a bit more expensive to compete with Google's new near-line 
> storage, which provides retrieval times on the order of seconds.
> 
>       -Rob
> 
>> On Jul 14, 2015, at 10:10 AM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> 
>> wrote:
>> 
>> On this note -
>> What I really want is a file system that can transparently manage  data 
>> between primary (SSD), secondary (S3) and tertiary (Amazon Glacier)  stores. 
>>  Actively used data would migrate into primary storage. Old archived data 
>> moves off into cheaper tertiary storage. I've thought of implementing this 
>> at the application level in RAMADDA but a file system based approach would 
>> be much smarter.
>> 
>> How do the archive folks on this list manage these kinds of storage 
>> environments?
>> 
>> -Jeff
>> 
>> 
>> 
>> 
>> On Tue, Jul 14, 2015 at 10:44 AM, John Caron <caron@xxxxxxxx> wrote:
>> Hi David:
>> 
>> At the bottom of the TDM, we rely on RandomAccessFile. Do you know if S3 
>> supports that abstraction (essentially posix file semantics, eg seek(), 
>> read()) ? My guess is that S3 only allows complete file transfers (?)
>> 
>> Would be worth investigating if anyone has implemented a java 
>> FileSystemProvider for S3.
>> 
>> Will have a closer look when i get time.
>> 
>> John
>> 
>> On Mon, Jul 13, 2015 at 7:59 PM, David Nahodil <David.Nahodil@xxxxxxxxxxx> 
>> wrote:
>> Hi all,
>> 
>> 
>> We are looking at moving our THREDDS Data Server to Amazon EC2 instances 
>> with the data hosted on S3. I'm just wondering if anyone has tried using TDS 
>> with data hosted on S3?
>> 
>> 
>> I had a quick back-and-forth with Sean at Unidata (see below) about this.
>> 
>> 
>> Regards,
>> 
>> 
>> David
>> 
>> 
>> > > Unfortunately, I do not know of anyone who has done this, although we 
>> > > have had at lease one other person ask. From what I understand, there is 
>> > > a way to mount an S3 storage as a virtual file system, in which case I 
>> > > would *think* that the TDS would work as it normally does (depending on 
>> > > the kind of data you have).
>> 
>> > We have considered mounting the S3 storage as a filesystem and running it 
>> > like that. However, our feeling was that the tools were not really 
>> > production ready and that we're really misrepresenting S3 by pretending it 
>> > is a file system. So this is why we're investigating if anyone has used 
>> > TDS with the S3 API directly.
>> 
>> > > What kind of data do you have? Will your TDS also be in the cloud? Do 
>> > > you plan on serving the data inside of amazon to other EC2 instances, or 
>> > > do you plan on crossing the cloud/commodity web boundary with the data, 
>> > > in which case that could get very expensive quite quickly?
>> 
>> > We have about 2 terabytes of marine and climate data that we are currently 
>> > serving from our existing infrastructure. The plan is to move the 
>> > infrastructure to Amazon Web Services so TDS would be hosted on EC2 
>> > machines and the data on S3. We're hoping this setup should work okay, but 
>> > we might still have a hurdle or two to come. :)
>> 
>> > We have someone here who once wrote a plugin/adapter for TDS to work with 
>> > an obscure filesystem that our data used to be stored on. So we have a 
>> > little experience in what might be involved in what might be involved for 
>> > doing the same with S3. We just wanted to make sure that if anyone had 
>> > done some work already that we made use of that.
>> 
>> > > We very, very recently (as in a day ago) got some Amazon resources to 
>> > > play around on, but we won't have a chance to kick those tires until 
>> > > after our training workshops at the end of the month.
>> 
>> 
>> University of Tasmania Electronic Communications Policy (December, 2014). 
>> This email is confidential, and is for the intended recipient only. Access, 
>> disclosure, copying, distribution, or reliance on any of it by anyone 
>> outside the intended recipient organisation is prohibited and may be a 
>> criminal offence. Please delete if obtained in error and email confirmation 
>> to the sender. The views expressed in this email are not necessarily the 
>> views of the University of Tasmania, unless clearly intended otherwise.
>> 
>> 
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/ 
>> 
>> 
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/ 
>> 
>> _______________________________________________
>> thredds mailing list
>> thredds@xxxxxxxxxxxxxxxx
>> For list information or to unsubscribe,  visit: 
>> http://www.unidata.ucar.edu/mailing_lists/
> 
> 
> _______________________________________________
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/

= = =
Nathan Potter                        ndp at opendap.org
OPeNDAP, Inc.                        +1.541.231.3317



  • 2015 messages navigation, sorted by:
    1. Thread
    2. Subject
    3. Author
    4. Date
    5. ↑ Table Of Contents
  • Search the thredds archives: