Re: [thredds] THREDDS Data Server serving from Amazon S3

To: David Nahodil <David.Nahodil@xxxxxxxxxxx>
Subject: Re: [thredds] THREDDS Data Server serving from Amazon S3
From: Robert Casey <rob@xxxxxxxxxxxxxxxxxxx>
Date: Wed, 15 Jul 2015 14:00:54 -0700
        All-

        I accidentally ran across a commercial solution that looks like it 
would fit the bill for THREDDS on S3, provided you had the funds to spend on it:

http://panzura.com/products/global-file-system/ 
<http://panzura.com/products/global-file-system/>

http://go.panzura.com/rs/panzura/images/AmazonSolutionBrief.pdf 
<http://go.panzura.com/rs/panzura/images/AmazonSolutionBrief.pdf>



> On Jul 14, 2015, at 11:40 PM, David Nahodil <David.Nahodil@xxxxxxxxxxx> wrote:
> 
> Hi all,
> 
> As you mention mounting S3 as a file system was problematic for a few reasons 
> (including the speed issues) which is why we wanted to look into other 
> options.
> 
> I had come across the work NOAA/OpenDAP work on Hyrax which was interesting, 
> and might still be a fall-back depending on how my investigations go. I 
> hadn't seen that paper though (thanks James) and that was a good summary of 
> the work and findings.
> 
> There are other considerations and techniques which I am still learning about 
> with AWS. It might still be the case that we need to use EBS in conjunction 
> with S3.
> 
> John's point about the file system operations to provide the random access 
> required is an important one. If I understand correctly it, the Hyrax work 
> kept a local cache of files as they were needed. This local caching 
> populating from S3 might be the technique we need to employ. It's a pain to 
> manage resources like that (disk size, cache invalidation, etc.), though, so 
> we'll have to see.
> 
> I made a bit of progress today fleshing-out a CrawlableDataset, I expect to 
> come up against a few more challenges tomorrow.
> 
> Thanks all for the input!
> 
> Cheers,
> 
> David
> 
> 
> 
> 
> 
> ________________________________________
> From: Nathan Potter <gnatman.p@xxxxxxxxx>
> Sent: Wednesday, 15 July 2015 8:17 AM
> To: Jeff McWhirter
> Cc: Nathan Potter; Robert Casey; John Caron; David Nahodil; 
> thredds@xxxxxxxxxxxxxxxx
> Subject: Re: [thredds] THREDDS Data Server serving from Amazon S3
> 
> Jeff,
> 
> I would also add that because of the time costs associated with retrieving 
> from Glacier it becomes crucial to only get what you really want. To that end 
> I believe that such a system can only work well if the granule metadata (and 
> probably any shared dimensions or Map vectors) be cached outside of Glacier 
> so that users can retrieve these information without having to incur the time 
> and fiscal expense of a full Glacier retrieval.
> 
> In DAP parlance the DDS/DAS/DDX/DMR responses should be immediately available 
> for all holdings. And I think that Map vectors/dimension should also be 
> included in this. This would go a long way towards making such a system 
> useful to a savvy client.
> 
> 
> N
> 
>> On Jul 14, 2015, at 2:52 PM, Nathan Potter <ndp@xxxxxxxxxxx> wrote:
>> 
>> 
>> 
>> Jeff,
>> 
>> I wrote some prototypes for Hyrax that utilized Glacier/S3/EBS to manage 
>> data. It was a proof of concept effort - not production ready by any means. 
>> It seems your idea was very much in line with my own. My thinking was to put 
>> EVERYTHING in Glacier and then spool things off Glacier into S3, and then 
>> from there into EBS as needed by the server. Things would get spooled from 
>> Glacier to S3 and then copied into an EBS volume for operational access. 
>> Last accessed content would get purged, first from EBS and then later from 
>> S3 so that eventually the only copy would be in Glacier, at least until the 
>> item was accessed again. I think it would be really interesting to work up a 
>> real system that does this for DAP services of science data!
>> 
>> Our experience with S3 filesystems was similar to Roy’s: The various S3 
>> filesystem solutions that we looked at did’t really cut it (speed & 
>> proprietary utilization of S3). But managing S3 isn’t that tough, I skipped 
>> the filesystem approach and just used the AWS HTTP API for S3 and it was 
>> quick and easy. Glacier is more difficult: Access times are long for 
>> everything. That includes 4 hours to get an inventory report, despite the 
>> fact that the inventory is computed by AWS once every 24 hours. So managing 
>> the Glacier holdings by keeping local copies of the inventory is important 
>> as is having a way to verify that the local inventory stays in sync with the 
>> actual inventory.
>> 
>> 
>> Nathan
>> 
>>> On Jul 14, 2015, at 1:00 PM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> 
>>> wrote:
>>> 
>>> 
>>> Glacier could be used for storage of all that data that you need to keep 
>>> around but rarely if ever access  - e.g., level-0 instrument output, raw 
>>> model output,  etc. If your usage model supports this type of latency then 
>>> the cost savings (1/10th) are significant
>>> 
>>> This is where hiding the storage semantics behind a file system breaks 
>>> down. The application can't be agnostic of the underlying storage as they 
>>> need to support delays in staging data, communicating to the end-user, 
>>> caching, etc.
>>> 
>>> -Jeff
>>> 
>>> 
>>> 
>>> On Tue, Jul 14, 2015 at 1:35 PM, Robert Casey <rob@xxxxxxxxxxxxxxxxxxx> 
>>> wrote:
>>> 
>>>     Hi Jeff-
>>> 
>>>     Of note, Amazon Glacier is meant for infrequently needed data, so a 
>>> call-up for data from that source will require something on the order of a 
>>> 5 hour wait to retrieve to S3.  I think they are developing a near-line 
>>> storage solution that is a bit more expensive to compete with Google's new 
>>> near-line storage, which provides retrieval times on the order of seconds.
>>> 
>>>     -Rob
>>> 
>>>> On Jul 14, 2015, at 10:10 AM, Jeff McWhirter <jeff.mcwhirter@xxxxxxxxx> 
>>>> wrote:
>>>> 
>>>> On this note -
>>>> What I really want is a file system that can transparently manage  data 
>>>> between primary (SSD), secondary (S3) and tertiary (Amazon Glacier)  
>>>> stores.  Actively used data would migrate into primary storage. Old 
>>>> archived data moves off into cheaper tertiary storage. I've thought of 
>>>> implementing this at the application level in RAMADDA but a file system 
>>>> based approach would be much smarter.
>>>> 
>>>> How do the archive folks on this list manage these kinds of storage 
>>>> environments?
>>>> 
>>>> -Jeff
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Tue, Jul 14, 2015 at 10:44 AM, John Caron <caron@xxxxxxxx> wrote:
>>>> Hi David:
>>>> 
>>>> At the bottom of the TDM, we rely on RandomAccessFile. Do you know if S3 
>>>> supports that abstraction (essentially posix file semantics, eg seek(), 
>>>> read()) ? My guess is that S3 only allows complete file transfers (?)
>>>> 
>>>> Would be worth investigating if anyone has implemented a java 
>>>> FileSystemProvider for S3.
>>>> 
>>>> Will have a closer look when i get time.
>>>> 
>>>> John
>>>> 
>>>> On Mon, Jul 13, 2015 at 7:59 PM, David Nahodil <David.Nahodil@xxxxxxxxxxx> 
>>>> wrote:
>>>> Hi all,
>>>> 
>>>> 
>>>> We are looking at moving our THREDDS Data Server to Amazon EC2 instances 
>>>> with the data hosted on S3. I'm just wondering if anyone has tried using 
>>>> TDS with data hosted on S3?
>>>> 
>>>> 
>>>> I had a quick back-and-forth with Sean at Unidata (see below) about this.
>>>> 
>>>> 
>>>> Regards,
>>>> 
>>>> 
>>>> David
>>>> 
>>>> 
>>>>>> Unfortunately, I do not know of anyone who has done this, although we 
>>>>>> have had at lease one other person ask. From what I understand, there is 
>>>>>> a way to mount an S3 storage as a virtual file system, in which case I 
>>>>>> would *think* that the TDS would work as it normally does (depending on 
>>>>>> the kind of data you have).
>>>> 
>>>>> We have considered mounting the S3 storage as a filesystem and running it 
>>>>> like that. However, our feeling was that the tools were not really 
>>>>> production ready and that we're really misrepresenting S3 by pretending 
>>>>> it is a file system. So this is why we're investigating if anyone has 
>>>>> used TDS with the S3 API directly.
>>>> 
>>>>>> What kind of data do you have? Will your TDS also be in the cloud? Do 
>>>>>> you plan on serving the data inside of amazon to other EC2 instances, or 
>>>>>> do you plan on crossing the cloud/commodity web boundary with the data, 
>>>>>> in which case that could get very expensive quite quickly?
>>>> 
>>>>> We have about 2 terabytes of marine and climate data that we are 
>>>>> currently serving from our existing infrastructure. The plan is to move 
>>>>> the infrastructure to Amazon Web Services so TDS would be hosted on EC2 
>>>>> machines and the data on S3. We're hoping this setup should work okay, 
>>>>> but we might still have a hurdle or two to come. :)
>>>> 
>>>>> We have someone here who once wrote a plugin/adapter for TDS to work with 
>>>>> an obscure filesystem that our data used to be stored on. So we have a 
>>>>> little experience in what might be involved in what might be involved for 
>>>>> doing the same with S3. We just wanted to make sure that if anyone had 
>>>>> done some work already that we made use of that.
>>>> 
>>>>>> We very, very recently (as in a day ago) got some Amazon resources to 
>>>>>> play around on, but we won't have a chance to kick those tires until 
>>>>>> after our training workshops at the end of the month.
>>>> 
>>>> 
>>>> University of Tasmania Electronic Communications Policy (December, 2014).
>>>> This email is confidential, and is for the intended recipient only. 
>>>> Access, disclosure, copying, distribution, or reliance on any of it by 
>>>> anyone outside the intended recipient organisation is prohibited and may 
>>>> be a criminal offence. Please delete if obtained in error and email 
>>>> confirmation to the sender. The views expressed in this email are not 
>>>> necessarily the views of the University of Tasmania, unless clearly 
>>>> intended otherwise.
>>>> 
>>>> 
>>>> _______________________________________________
>>>> thredds mailing list
>>>> thredds@xxxxxxxxxxxxxxxx
>>>> For list information or to unsubscribe,  visit: 
>>>> http://www.unidata.ucar.edu/mailing_lists/
>>>> 
>>>> 
>>>> _______________________________________________
>>>> thredds mailing list
>>>> thredds@xxxxxxxxxxxxxxxx
>>>> For list information or to unsubscribe,  visit: 
>>>> http://www.unidata.ucar.edu/mailing_lists/
>>>> 
>>>> _______________________________________________
>>>> thredds mailing list
>>>> thredds@xxxxxxxxxxxxxxxx
>>>> For list information or to unsubscribe,  visit: 
>>>> http://www.unidata.ucar.edu/mailing_lists/
>>> 
>>> 
>>> _______________________________________________
>>> thredds mailing list
>>> thredds@xxxxxxxxxxxxxxxx
>>> For list information or to unsubscribe,  visit: 
>>> http://www.unidata.ucar.edu/mailing_lists/
>> 
>> = = =
>> Nathan Potter                        ndp at opendap.org
>> OPeNDAP, Inc.                        +1.541.231.3317
>> 
> 
> = = =
> Nathan Potter                        ndp at opendap.org
> OPeNDAP, Inc.                        +1.541.231.3317
> 
> 
> _______________________________________________
> thredds mailing list
> thredds@xxxxxxxxxxxxxxxx
> For list information or to unsubscribe,  visit: 
> http://www.unidata.ucar.edu/mailing_lists/
References:
- [thredds] THREDDS Data Server serving from Amazon S3
  - From: David Nahodil
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: John Caron
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Jeff McWhirter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Robert Casey
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Jeff McWhirter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Nathan Potter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: Nathan Potter
- Re: [thredds] THREDDS Data Server serving from Amazon S3
  - From: David Nahodil
2015 messages navigation, sorted by:
1. Thread
2. Subject
3. Author
4. Date
5. ↑ Table Of Contents
Search the thredds archives: