[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [netcdf-java] "Too many open files" for NcML aggregation in Netcdf-Java 4.0




Jon Blower wrote:
> Hi John,
> 
> I downloaded the latest version (wasn't identified as 4.0.25 but was
> dated yesterday so I assume it's the right one).  

You can always check the META-INF/manifest.mf file in the jar:

Manifest-Version: 1.0
Ant-Version: Apache Ant 1.6.5
Created-By: 10.0-b22 (Sun Microsystems Inc.)
Built-By: caron
Built-On: 2008-10-07 23:02:32
Implementation-Title: NetCDF-Java-Library
Implementation-Version: 4.0.25.20081007.2302
Implementation-Vendor: UCAR/Unidata

Good news - I can
> see several cycles of RAF cleanup happening and we can load more
> datasets.  Bad news - I get a new error:
> 
> java.lang.NullPointerException
>       at 
> ucar.nc2.ncml.AggregationOuterDimension$CacheVar.read(AggregationOuterDimension.java:814)
>       at 
> ucar.nc2.ncml.AggregationOuterDimension$CoordValueVar.read(AggregationOuterDimension.java:900)
>       at 
> ucar.nc2.ncml.AggregationOuterDimension$DatasetOuterDimension.cacheVariables(AggregationOuterDimension.java:679)
>       at ucar.nc2.ncml.Aggregation$Dataset.close(Aggregation.java:522)
>       at 
> ucar.nc2.ncml.AggregationOuterDimension$DatasetOuterDimension.getNcoords(AggregationOuterDimension.java:597)
>       at 
> ucar.nc2.ncml.AggregationOuterDimension$DatasetOuterDimension.setStartEnd(AggregationOuterDimension.java:613)
>       at 
> ucar.nc2.ncml.AggregationOuterDimension.buildCoords(AggregationOuterDimension.java:143)
>       at 
> ucar.nc2.ncml.AggregationExisting.buildDataset(AggregationExisting.java:71)
>       at ucar.nc2.ncml.Aggregation.finish(Aggregation.java:262)
>       at ucar.nc2.ncml.NcMLReader.readNetcdf(NcMLReader.java:424)
>       at ucar.nc2.ncml.NcMLReader.readNcML(NcMLReader.java:377)
>       at ucar.nc2.ncml.NcMLReader.readNcML(NcMLReader.java:203)
>       at ucar.nc2.ncml.NcMLReader.readNcML(NcMLReader.java:153)
>       at ucar.nc2.dataset.NetcdfDataset.acquireNcml(NetcdfDataset.java:624)
>       at 
> ucar.nc2.dataset.NetcdfDataset.openOrAcquireFile(NetcdfDataset.java:572)
>       at ucar.nc2.dataset.NetcdfDataset.openDataset(NetcdfDataset.java:344)
>       at 
> ucar.nc2.dataset.NetcdfDataset$MyNetcdfDatasetFactory.open(NetcdfDataset.java:445)
>       at 
> ucar.nc2.dataset.NetcdfDataset$MyNetcdfDatasetFactory.open(NetcdfDataset.java:436)
>       at ucar.nc2.util.cache.FileCache.acquire(FileCache.java:178)
>       at ucar.nc2.dataset.NetcdfDataset.acquireNcml(NetcdfDataset.java:627)
>       at 
> ucar.nc2.dataset.NetcdfDataset.openOrAcquireFile(NetcdfDataset.java:572)
>       at ucar.nc2.dataset.NetcdfDataset.acquireDataset(NetcdfDataset.java:433)
>       at ucar.nc2.dataset.NetcdfDataset.acquireDataset(NetcdfDataset.java:399)
> 
> The NcML file is the same as the one I sent before.   The NcML is the
> same for all datasets so I don't know why it would fail on this one.
> I think the NPE is misleading.  I think this is a "too many files"
> error in disguise, because each subsequent call to anything that
> performs IO (e.g. dir.listFiles()) fails with a NPE.  Logs show 415
> open file handles before this dataset was opened, and the offending
> dataset has 564 files.

yes, it was caused by another exception that was masked. ive fixed that problem.

> 
> So perhaps we're going too fast for the cleanup thread, sending the
> number of file handles above the limit (which seems to be between 900
> and 1000 in my system).  I think a hard limit on the size of the file
> cache would be useful, and more reassuring for me, as I'd be confident
> that the error wasn't going to occur randomly.

ive added a hardLimit parameter, which you can set with:

  NetcdfDataset.initNetcdfFileCache(int minElementsInMemory, int softLimit, int 
hardLimit, int period);

it will do a synchronous cleanup if its reached. The softLimit starts up a 
cleanup in another threads 100 millisecs later. Im surprised thats not fast 
enough, perhaps the threads in that OS behave differently? Or perhaps file 
opening is just very fast. I could allow setting that delay time. also, you may 
want to track down the ulimit setting for that OS, ive always assumed its 
important to have lots of file handles for good performance.

Ive only had time to do the most basic of unit tests on this, so please let me 
know if you see anything suspicious.

unfortunately, I cant make a release because of a minor screwup in our system. 
ethan might be able to do so later tonight, so im cc'ing him so he can send you 
a message if he does. otherwise, ill do it first thing in the morning