Hello,
A nice feature of the java netcdf library is it allows us to use a netcdf
resource in the same way regardless of locality. Our code can use the same
library calls to open a netcdf resource whether it is on a local filesystem or
on a web server.
The java netcdf library makes use of the HTTP Range header in the
HTTPRandomAccessFile class. This means the whole netcf resource does not need
to be read or downloaded prior to use. It seems the netcdf library handles the
details quite well, requesting byte ranges similar to the way it would if the
resource were on a local filesystem.
One downside of the approach is the amount of heap memory allocated by default
for each netcdf resource, especially in the case of forecasts spanning multiple
netcdf resources greater than ten million bytes each.
When attempting to open a single ensemble forecast composed of multiple netcdf
resources (in this case seven members times sixty-eight timesteps) prior to
reading values from them, an OutOfMemoryError is encountered with the following
stack trace:
java.lang.OutOfMemoryError: Java heap space
at ucar.unidata.io.RandomAccessFile.init(RandomAccessFile.java:376)
at
ucar.unidata.io.RandomAccessFile.setBufferSize(RandomAccessFile.java:387)
at
ucar.unidata.io.http.HTTPRandomAccessFile.<init>(HTTPRandomAccessFile.java:98)
at
ucar.unidata.io.http.HTTPRandomAccessFile.<init>(HTTPRandomAccessFile.java:40)
at ucar.nc2.NetcdfFile.getRaf(NetcdfFile.java:615)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:506)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:473)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:458)
at ucar.nc2.NetcdfFile.open(NetcdfFile.java:446)
Two possibilities come to mind as to workarounds. First, allocate a larger
heap. How much larger? Perhaps ten million bytes times seven times sixty-eight,
around 4.5GiB more. But it does not seem right to require an additional 4.5GiB
heap to simply open several resources and suppose the user is on a 32-bit
system. Second, perhaps we could find a way to progressively open, read, and
close each resource. This might be possible, but seems clunky and incorrect. We
should be able to open all the resources, then access what is needed across
them, then close them. In this case, a forecast spans multiple resources and
the goal is to read a single forecast.
A third option is to consider code in HTTPRandomAccessFile near the
OutOfMemoryError and the variables involved.
public class HTTPRandomAccessFile extends ucar.unidata.io.RandomAccessFile {
public static final int defaultHTTPBufferSize = 20 * 1000; // 20K
public static final int maxHTTPBufferSize = 10 * 1000 * 1000; // 10 M
...
if (total_length > 0) {
// this means that we will read the file in one gulp then deal with it in
memory
int useBuffer = (int) Math.min(total_length, maxHTTPBufferSize); //
entire file size if possible
useBuffer = Math.max(useBuffer, defaultHTTPBufferSize); // minimum buffer
setBufferSize(useBuffer);
}
The effect of the Math.min and Math.max calls appears to cause a buffer size of
ten million bytes to be allocated for each netcdf resource greater than or
equal to ten million bytes.
Experimentation shows that there are more requests made when this
maxHTTPBufferSize is reduced, but the OutOfMemoryError is avoided.
The version control history shows it used to only use the twenty thousand
value, not ten million.
Is there any significance to the ten million byte buffer size?
Would you be willing to make a new default of two hundred thousand or perhaps
offer a Java System Property option to configure the value at JVM launch time?
For example, -Ducar.unidata.io.http.maxHTTPBufferSize=200000. It is preferable
to use a build tool to fetch the ucar-built and ucar-tested cdm artifact to get
the latest and greatest updates rather than maintain a fork.
I have not experimented with the setting with any datasets other than this
narrow use case, so I also wonder about the impact on other uses. All I can see
from experimentation is that a trade-off is made between request/response
overhead on the one hand (higher when set lower) and data volume on the other
hand (higher when set higher).
The trace above is with cdm-5.1.0.jar, from the ucar artifact repository
(fetched with gradle), with sha256sum
d211d2b040aa1d63bc3a6898bb27f55fb116f743dee4572b7f1228e9d4cf37f1.
Thank you for your consideration,
Jesse Bickel
Contractor, ERT, Inc.
Federal Affiliation: NWC/OWP/NOAA/DOC