Showing entries tagged [chunking]

Chunking Algorithms for NetCDF-C

Unidata is in the process of developing a Zarr [] based variant of netcdf. As part of this effort, it was necessary to implement some support for chunking. Specifically, the problem to be solved was that of extracting a hyperslab of data from an n-dimensional variable (array in Zarr parlance) that has been divided into chunks (in the HDF5 sense). Each chunk is stored independently in the data storage -- Amazon S3, for example.

The algorithm takes a series of R slices of the form (first,stop,stride), where R is the rank of the variable. Note that a slice of the form (first, count, stride), as used by netcdf, is equivalent because stop = first + count*stride. These slices form a hyperslab.

The goal is to compute the set of chunks that intersect the hyperslab and to then extract the relevant data from that set of chunks to produce the hyperslab.

[Read More]

NetCDF Compression

The steady state of disks is full. --Ken Thompson

Introduction

From our support questions, it appears that the major feature of netCDF-4 attracting users to upgrade their libraries from netCDF-3 is compression. The netCDF-4 libraries inherit the capability for data compression from the HDF5 storage layer underneath the netCDF-4 interface. Linking a program that uses netCDF to a netCDF-4 library allows the program to read compressed data without changing a single line of the program source code. Writing netCDF compressed data only requires a few extra statements. And the nccopy utility program supports converting classic netCDF format data to or from compressed data without any programming.

[Read More]

Chunking Data: Choosing Shapes

In part 1, we explained what data chunking is about in the context of scientific data access libraries such as netCDF-4 and HDF5, presented a 38 GB 3-dimensional dataset as a motivating example, discussed benefits of chunking, and showed with some benchmarks what a huge difference chunk shapes can make in balancing read times for data that will be accessed in multiple ways.

In this post, I'll continue looking at that example dataset to see how we can derive good chunk shapes, generalize to other datasets, look at how long it can take to rechunk a multidimensional dataset, and look at the use of Solid State Disk (SSD) for both accessing and rechunking data.

[Read More]

Chunking Data: Why it Matters

What is data chunking? How can chunking help to organize large multidimensional datasets for both fast and flexible data access?  How should chunk shapes and sizes be chosen?  Can software such as netCDF-4 or HDF5 provide better defaults for chunking? If you're interested in those questions and some of the issues they raise, read on ...

[Read More]

Netcdf-4 Chunking Performance Results on AR-4 3D Data File

Some results from AR-5 performance evaluation

As part of analyzing netcdf-4 performance for the upcoming AR-5 climate data archive, I have been running benchmarks on some AR-4 (3D precip flux) data that I got from Gary Strand (thanks Gary!) pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12.nc.

Here's what's in the file:

 netcdf pr_A1.20C3M_8.CCSM.atmm.1870-01_cat_1999-12
 {                                                                                                          
  dimensions:                                                                                                                                                      
          lon = 256 ;                                                                                                                                              
          lat = 128 ;                                                                                                                                              
          bnds = 2 ;                                                                                                                                               
          time = UNLIMITED ; // (1560 currently)                                                                                                                   
  variables:                                                                                                                                                       
          double lon_bnds(lon, bnds) ;                                                                                                                             
          double lat_bnds(lat, bnds) ;                                                                                                                             
          double time_bnds(time, bnds) ;                                                                                                                           
          double time(time) ;                                                                                                                                      
                  time:calendar = "noleap" ;                                                                                                                       
                  time:standard_name = "time" ;                                                                                                                    
                  time:axis = "T" ;                                                                                                                                
                  time:units = "days since 0000-1-1" ;                                                                                                             
                  time:bounds = "time_bnds" ;                                                                                                                      
                  time:long_name = "time" ;                                                                                                                        
          double lat(lat) ;                                                                                                                                        
                  lat:axis = "Y" ;                                                                                                                                 
                  lat:standard_name = "latitude" ;                                                                                                                 
                  lat:bounds = "lat_bnds" ;                                                                                                                        
                  lat:long_name = "latitude" ;                                                                                                                     
                  lat:units = "degrees_north" ;                                                                                                                    
          double lon(lon) ;                                                                                                                                        
                  lon:axis = "X" ;                                                                                                                                 
                  lon:standard_name = "longitude" ;                                                                                                                
                  lon:bounds = "lon_bnds" ;                                                                                                                        
                  lon:long_name = "longitude" ;                                                                                                                    
                  lon:units = "degrees_east" ;                                                                                                                     
          float pr(time, lat, lon) ;                                                                                                                               
                  pr:comment = "Created using NCL code CCSM_atmm_2cf.ncl on\n",                                                                                    
                          " machine mineral" ;                                                                                                                     
                  pr:missing_value = 1.e+20f ;                                                                                                                     
                  pr:_FillValue = 1.e+20f ;                                                                                                                        
                  pr:cell_methods = "time: mean (interval: 1 month)" ;                                                                                             
                  pr:history = "(PRECC+PRECL)*r[h2o]" ;                                                                                                            
                  pr:original_units = "m-1 s-1" ;                                                                                                                  
                  pr:original_name = "PRECC, PRECL" ;                                                                                                              
                  pr:standard_name = "precipitation_flux" ;                                                                                                        
                  pr:units = "kg m-2 s-1" ;                                                                                                                        
                  pr:long_name = "precipitation_flux" ;                                                                                                            
                  pr:cell_method = "time: mean" ;          

And here are the first results of putting this data in different sets of chunksizes, with no compression. The first I read all horizontal slabs in the file, then 5 time series. The times show the time to read each slab, and the time to read each time series, in microseconds.

cs[0]   cs[1]   cs[2]   cache(MB) deflate shuffle read_hor(us) read_time_ser(us)
0       0       0       0         0       0       240          3822
1       16      32      1         0       0       667          57087
1       16      128     1         0       0       245          23929
1       16      256     1         0       0       160          26913
1       64      32      1         0       0       277          22840
1       64      128     1         0       0       147          41359
1       64      256     1         0       0       110          47856
1       128     32      1         0       0       205          25052
1       128     128     1         0       0       123          47417
1       128     256     1         0       0       97           68877
10      16      32      1         0       0       552          3284
10      16      128     1         0       0       204          5834
10      16      256     1         0       0       138          8465
10      64      32      1         0       0       233          5268
10      64      128     1         0       0       132          16690
10      64      256     1         0       0       99           28037
10      128     32      1         0       0       180          8414
10      128     128     1         0       0       113          28064
10      128     256     1         0       0       90           54715
256     16      32      1         0       0       8853         1167
256     16      128     1         0       0       8012         3677
256     16      256     1         0       0       118          1581
256     64      32      1         0       0       8170         3737
256     64      128     1         0       0       227          1640
256     64      256     1         0       0       80           1627
256     128     32      1         0       0       645          1624
256     128     128     1         0       0       211          1650
256     128     256     1         0       0       68           1667
1024    16      32      1         0       0       32337        1192
1024    16      128     1         0       0       296          1489
1024    16      256     1         0       0       114          1564
1024    64      32      1         0       0       679          1415
1024    64      128     1         0       0       221          1503
1024    64      256     1         0       0       79           1669
1024    128     32      1         0       0       646          1558
1024    128     128     1         0       0       208          1568
1024    128     256     1         0       0       68           1646
1560    16      32      1         0       0       55064        1055
1560    16      128     1         0       0       298          1438
1560    16      256     1         0       0       115          1477
1560    64      32      1         0       0       685          1425
1560    64      128     1         0       0       225          1545
1560    64      256     1         0       0       79           1589
1560    128     32      1         0       0       658          1535
1560    128     128     1         0       0       208          1567
1560    128     256     1         0       0       68           1544

The first line shows the read times for the classic netcdf file.

I am happy to see there are a number of cases that clearly outperform classic netcdf. The trick is to come up with some algorithm that comes up with the correct answers without the user being involved.

Unidata Developer's Blog
A weblog about software development by Unidata developers*
Unidata Developer's Blog
A weblog about software development by Unidata developers*

Welcome

FAQs

News@Unidata blog

Take a poll!

What if we had an ongoing user poll in here?

Browse By Topic
Browse by Topic
« January 2025
SunMonTueWedThuFriSat
   
1
2
3
4
5
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
 
       
Today