[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[netCDF #JZW-805384]: Poor I/O performance on large blocksize filesystems



Gary,

The slowdown occurs when all these conditions are met:

  1.  You're dealing with netCDF classic or 64-bit offset format
      files, not netCDF-4 or netCDF-4 classic model files.

  2.  You have an unlimited dimension and many record variables that
      use it.

  3.  The file system has a large block size, the atomic size for
      disk access.

In this case, doing things a variable at a time instead of a record at
a time can be very slow, because accessing all the data in a variable
(or some part of each record for a variable) typically reads each
record multiple times, once for each record variable you're dealing
with.  That's because the block size is larger than it needs to be to
hold a record's worth of data for each variable, so accessing the nth
record's data for a variable typically reads in more data than is
needed.

Consider a case that's not too atypical: a block size of 2 MiBytes,
365 records, and 100 float record variables, each dimensioned
(time=365, lat=73, lon=145) where time is the record dimension.  A
record's worth of data for each variable is only 73*145*4 = 42340
bytes, and each variable has 365 records.  So reading one variable of
size 365*73*145*4 of about 15.5 Mbytes actually reads 365 disk blocks,
which is 365*2097152 bytes or 765 Mbytes.  That's about 50 times more
bytes read than needed.  If you operate on every variable in the file,
one at a time, the result is 50 times more I/O than necessary, which
explains why it might be 50 times slower than it would be if you used
fixed size variables, stored contiguously, rather than record
variables, stored in pieces scattered throughout the records of the
file.

How can you deal with this to get efficient processing of such files?
Here are some workarounds and solutions:

1.  Don't use the unlimited dimension if you don't really need it.

2.  Make sure the record size of each variable is at least as big as
    the disk block size.

3.  Convert your record-oriented file to a file with only fixed size
    dimensions before using it in processing.  There's an nco operator
    for this, or you can use "nccopy -u infile outfile" to make the
    unlimited dimension a fixed size.

4.  Change the processing algorithms to read input a record at a time
    instead of a variable at a time, processing all the record
    variables after each record has been read.

5.  Use netCDF-4 classic model files or regular netCDF-4 files.  With
    the netCDF-4/HDF5 format, data is accessed in disk blocks, if
    stored contiguously, or by chunks for chunked data.  A chunk only
    contains data from a single variable.  Making chunks larger than
    disk blocks insures that I/O will be efficient.  If data is
    compressed, each chunk is compressed separately, so if compressed
    chunks are much smaller than the disk block size, inefficiencies
    may still occur.

I will be using approach number 4 in nccopy to detect and deal with
this situation on systems with large block size.

--Russ

> Below are the two key posts of a discussion amongst Gary Strand, CISL,
> and myself about NCO and netCDF performance issues on large blocksize
> filesystems. Full thread at
> https://sourceforge.net/projects/nco/forums/forum/9829/topic/4898620
> Would appreciate any insight from Unidata on the problem.
> Charlie
> ************************************************************************
> From Gary Strand 20111222:
> 
> (Preface: I've been a very happy user of nco for many years)
> 
> Technical details:
> NCO netCDF Operators version "4.0.8" last modified 2011/04/26 built Oct
> 18 2011
> on mirage4 by jam
> ncks version 4.0.8
> Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
> Copyright (C) 1995--2011 Charlie Zender
> 
> Problem: This issue may be related to the NOFILL issue with netCDF 4.1.2; in
> any case, on filesystems with large blocksizes (2M, for example,
> 'lustre' and
> NCAR's GLADE system) the I/O performance of even simple 'ncks' operations is
> horrible - time-to-completion ratios (compared to smaller blocksize
> filesystems)
> of 300:1 or even 1500:1 are not uncommon.
> 
> Investigation with NCAR CISL staff showed that a simple variable extraction
> that takes about 20 seconds on a small blocksize filesystem takes about
> 40 minutes
> on the GLADE filesystem (120:1 ratio) and that the following was found:
> 
> 12/20/11 3:57 PM JAM
> I should add that the actual performance for the first 39 minute test
> was around
> 30MB/sec for reads and 12MB/sec for writes.  So nco may be doing
> something else
> inefficiently in addition to reading/writing extra data.
> 
> 12/20/11 3:52 PM JAM
> Hi, we've done some testing since first getting this ticket and have
> found that
> the performance of ncks on filesystems with large block sizes (most of Glade
> is at a 2MB block size) is VERY bad and it seems to be reading/writing much
> more data than necessary.
> 
> The test we used was: "ncks -x -v TH
> b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc out.nc"
> 
> - The input file is 3.3GB and the output file is 1.1GB.  On an idle
> system (storm4)
> this command took around 39 minutes to complete when either input or output
> file is on a Glade filesystem.  During this time 60GB was read from
> glade and
> 26GB was written.
> 
> - Adding the -4 option to ncks resulted in the test taking 10 minutes to
> complete
> with 30GB read and 1.2GB written.
> 
> - We also ran the same tests on a Lustre filesystem with a 1MB block
> size and
> saw similar bad performance.
> 
> - Finally, running the same test with both input/output files in /tmp (local
> drive, 4k block size) finished in 17 seconds.
> 
> I don't know exactly what ncks does or how it does it, but there seems to be
> an issue with large-block filesystems possibly causing it to read and write
> overlapping blocks of data, resulting in the very large numbers of extra
> bytes
> read/written listed above.  Large-block filesystems also caused the
> silent data
> corruption issue with nco a few months back, which could be related.
> 
> With this information and your plots, the effect of system load and
> number of
> users is not as significant as we originally thought, and the bad
> performance
> on glade is most likely related to the actual amount of data being
> transferred
> (86GB in the worst case).
> 
> Is this an NCO problem or a netCDF-4 problem?
> ************************************************************************
> From Charlie Zender 20120103:
> 
> Hi Gary,
> 
> I have reproduced the problem you are experiencing using NCO on the
> large block filesystem (LBF) named GLADE. The binaries in
> ~zender/bin/[LINUXAMD64,AIX] improve the performance by about a factor
> of two relative to NCO 4.0.8, but something lower in the software
> stack than NCO, i.e.g, the netCDF library or the filesystem itself
> seems to cause the gross degradation in performance relative to
> NCO on smaller block filesystems.
> 
> Without going into too much detail, and for the benefit and comment of
> others following this issue, my conclusions about the slow performance
> of NCO on LBFs (i.e., GLADE) on both AIX and Linux are:
> 
> 0. NCO (and ncks in particular) doesn't use any fancy algorithms.
> NCO uses only offical, documented netCDF API calls to do its work.
> NCO does not pay attention to block-sizes. Unless hyperslabbing is
> requested, NCO transfers entire variables with _one call_ (rather than
> with continuous/consecutive calls) to nc_var_[get/put].
> 1. Slow performance on LBFs is experienced when any version of NCO is
> linked to any netCDF version including 4.1.3. I tested this with NCO
> 3.9.6 on AIX (using the bluefire default, i.e., /usr/local/bin/ncks
> which is linked to netCDF 3.6.2), and Gary or CISL tested this with
> NCO 4.0.8 on Linux (unsure what library they used).
> 2. NCO version 4.0.8 worsens the performance relative to other
> versions of NCO, but changes in 4.0.8 do not cause the underlying
> problem. 4.0.8 uses netCDF fill-mode to workaround the netCDF 4.1.2
> (and all preceding versions) "NOFILL" bug. This causes 4.0.8 to write
> (at least) twice as much data as other versions of NCO.
> 3. NCO version 4.0.9, which is in beta and not yet released, improves
> the performance by about a factor of two relative to 4.0.8. This is
> consistent with the reversion of 4.0.9 to previous NCO behavior which
> utilizes the netCDF NOFILL feature to reduce writes by (at least) a
> factor of two. It is only safe to use NCO 4.0.9+ with netCDF
> 4.1.3+. Otherwise the netCDF NOFILL bug may be triggered.
> 4. NCO operations on LBFs are twice as fast on Linux as on AIX.
> Extracting large datasets to netCDF3 files rather than netCDF4 files
> takes ~2.5 times as long. These factors are independent, so the best
> performance on large block filesystems is obtained with NCO 4.0.9 (or
> any NCO except 4.0.8) under Linux writing netCDF4 files. The worst
> performance will be with NCO 4.0.8 under AIX writing netCDF3 files.
> 5. Improving NCO performance on LBFs may require more detailed
> performance analysis and algorithms for sub-setting. An obvious place
> to start is to use a blocksize-sensitive copy size. Recent versions of
> nccopy use such an algorithm, I believe. However, this would require a
> significant code refactoring for NCO, which is not currently funded.
> However, NASA may fund implementation of groups in NCO. More on that
> in coming weeks. Maybe those funds can leverage some of this work.
> 6. Having written this much I'd like to hear from others before
> blabbing-on. I wasn't aware there was any penalty for LBFs, so credit
> goes to Gary for reporting the dramatic slow-downs on GLADE.
> Any good ideas for methods to speed up netCDF3 writes on LBFs?
> Are these performance penalties for LBFs better understood by others?
> 
> Charlie
> 
> Output of selected commands (extraneous stuff deleted):
> 
> # Copying 3 GB takes ~1 minute with AIX on GLADE
> zender@be1005en:~$ time /bin/cp
> /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc
> ~/gary.nc
> real    1m3.219s
> 
> # Copying 3 GB takes ~30 seconds with Linux on GLADE, twice as fast as AIX
> zender@mirage0:~$ time /bin/cp
> /glade/user/strandwg/NCO/b40.1850.track1.1deg.006a.cam2.h1.1012-01-01-00000.nc
> ~/gary.nc
> real    0m30.812s
> 
> # Test case takes ~8 minutes with ncks 3.9.6 on AIX
> zender@be1005en:~$ /usr/local/bin/ncks --lbr
> Linked to netCDF library version "3.6.2", compiled Apr  3 2007 14:19:36
> zender@be1005en:~$ /usr/local/bin/ncks --vrs
> NCO netCDF Operators version "3.9.6" last modified 2009/01/21 built Jan
> 28 2009 on be1105en by ddvento
> zender@be1005en:~$ time /usr/local/bin/ncks -O -D 3 -x -v TH ~/gary.nc
> ~/out3_blf_3.9.6.nc
> real    8m9.658s
> 
> # Test case takes ~8 minutes with ncks 4.0.9 on AIX
> zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks --lbr
> Linked to netCDF library version 4.1.3, compiled Aug 25 2011 08:32:40
> zender@be1005en:~$ /glade/home/zender/bin/AIX/ncks --vrs
> NCO netCDF Operators version 20120103 built Jan  3 2012 on
> be1005en.ucar.edu by zender
> zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -D 3 -x -v TH
> ~/gary.nc ~/out3_blf_4.0.9.nc
> real    7m48.197s
> 
> # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on AIX
> zender@be1005en:~$ time /glade/home/zender/bin/AIX/ncks -O -4 -D 3 -x -v
> TH ~/gary.nc ~/out4_blf_4.0.9.nc
> real    2m42.123s
> 
> # Test case takes ~4 minutes with ncks 4.0.9 on Linux
> zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks --lbr
> Linked to netCDF library version 4.1.3, compiled Jul 26 2011 15:05:13
> zender@mirage0:~$ /glade/home/zender/bin/LINUXAMD64/ncks/ncks --vrs
> NCO netCDF Operators version 20120103 built Jan 3 2012 on mirage0 by zender
> zender@mirage0:~$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -D 3 -x
> -v TH ~/gary.nc ~/out3_mrg_4.0.9.nc
> real 4m15.493s
> 
> # netCDF4 improves speed relative to netCDF3 by factor of ~2.5 on Linux
> zender@mirage0:~/nco$ time /glade/home/zender/bin/LINUXAMD64/ncks -O -4
> -D 3 -x -v TH ~/gary.nc ~/out4_mrg_4.0.9.nc
> real 1m44.345s
> --
> Charlie Zender, Department of Earth System Science
> University of California, Irvine 949-891-2429  )'(
> 
> 

Russ Rew                                         UCAR Unidata Program
address@hidden                      http://www.unidata.ucar.edu



Ticket Details
===================
Ticket ID: JZW-805384
Department: Support netCDF
Priority: Critical
Status: Closed