Reading large data

Data access must be fast

Figure description

The best way to get started with this topic is to look at an example. Let's assume that you must read a large sum of data from a binary file and store it in an array for further processing.

Java I/O is based on streams that represent a sequence of bytes. First, you must choose a stream type. We are working with binary data, so the FileInputStream class is the correct choice.

You should consider using the FileReaderclass when working with character data streams.

We can open a connection to an actual file like this:
InputStream in = new FileInputStream (fileName);

Effective approach for reading large data when time and memory allocation have to be considered to improve overall system performance.

Keeping performance issues in mind

At this point, it is possible to read data from the file, but let's take a closer look at other classes from the java.io package. The BufferedInputStream class is a wrapper for input streams, allowing buffering of its input and improving the reading process. You can connect to a file like this:

InputStream is = new BufferedInputStream (new FileInputStream (fileName));

Reading the file

When you've connected to the file, you can start reading from it. The InputStream class has two main methods for reading data:

  • int read()
  • int read(byte[] b,int off,int len).

The first method reads only one byte of data at a time, whereas the second one reads up to len bytes of data from the stream into an array of bytes.

Example 1

Obviously, the second method gains in performance, so we'll use it as presented in Listing A.

This listing has several interesting aspects. First, because the file is big, we allocate a rather big buffer (20 Mb) when calling the read method. The bigger the buffer, the faster all data is read. Actually, it is sometimes possible to know in advance the number of bytes that can be read from an input stream without blocking and  allocate a buffer of the same size. This is accomplished by calling the available method.

  • Unfortunately, this method does not always return correct results and can throw an exception. This is the case while reading database data as a long or BLOB via a stream.
  • Second, all arrays are initialized outside of the while loop, meaning out, buf, and tmp arrays are reused, so less objects are to be garbage-collected.
  • Third, when the buffer is filled with part of the data, it is copied into a growing array by calling the System.arraycopy method.

Although this algorithm is quite efficient, every read loop creates a temporal array and performs two array copies.

Example 2

You can reduce data copying and array allocation by modifying the while loop as shown inListing B.

Here, instead of storing intermediate data in a big array and extending it every time data is retrieved,

  1. it is maintained in a list, where each element contains only a piece of data.
  2. When the end of the stream is reached, the data can be taken from the list and merged into a single array. This allows you to save one array allocation and one copy operation. If you don't immediately need a whole data as an array, you can return the list itself and thus save some more time and resources. Reading data using this algorithm can be significantly faster than using the first one (Listing A). The difference in speed depends on the buffer array size that is used by read method.