Iterating through a massive file

Functions such as file_get_contents() and file() are quick and easy to use however, owing to memory limitations, they quickly cause problems when dealing with massive files. The default setting for the php.ini memory_limit setting is 128 megabytes. Accordingly, any file larger than this will not be loaded.

Another consideration when parsing through massive files is how quickly does your function or class method produce output? When producing user output, for example, although it might at first glance seem better to accumulate output in an array. You would then output it all at once for improved efficiency. Unfortunately, this might have an adverse impact on the user experience. It might be better to create a generator, and use the yield keyword to produce immediate results.

How to do it...

As mentioned before, the file* functions (that is, file_get_contents()), are not suitable for large files. The simple reason is that these functions, at one point, have the entire contents of the file represented in memory. Accordingly, the focus of this recipe will be on the f* functions (that is, fopen()).

In a slight twist, however, instead of using the f* functions directly, instead we will use the SplFileObject class, which is included in the SPL (Standard PHP Library):

First, we define a Application\Iterator\LargeFile class with the appropriate properties and constants:

namespace Application\Iterator;

use Exception;
use InvalidArgumentException;
use SplFileObject;
use NoRewindIterator;

class LargeFile
{
  const ERROR_UNABLE = 'ERROR: Unable to open file';
  const ERROR_TYPE   = 'ERROR: Type must be "ByLength", "ByLine" or "Csv"';     
  protected $file;
  protected $allowedTypes = ['ByLine', 'ByLength', 'Csv'];

We then define a __construct() method that accepts a filename as an argument and populates the $file property with an SplFileObject instance. This is also a good place to throw an exception if the file does not exist:

public function __construct($filename, $mode = 'r')
{
  if (!file_exists($filename)) {
    $message = __METHOD__ . ' : ' . self::ERROR_UNABLE . PHP_EOL;
    $message .= strip_tags($filename) . PHP_EOL;
    throw new Exception($message);
  }
  $this->file = new SplFileObject($filename, $mode);
}

Next we define a method fileIteratorByLine()method which uses fgets() to read one line of the file at a time. It's not a bad idea to create a complimentary fileIteratorByLength()method that does the same thing but uses fread() instead. The method that uses fgets() would be suitable for text files that include linefeeds. The other method could be used if parsing a large binary file:
```
protected function fileIteratorByLine()
{
  $count = 0;
  while (!$this->file->eof()) {
    yield $this->file->fgets();
    $count++;
  }
  return $count;
}
    
protected function fileIteratorByLength($numBytes = 1024)
{
  $count = 0;
  while (!$this->file->eof()) {
    yield $this->file->fread($numBytes);
    $count++;
  }
  return $count; 
}
```
Finally, we define a getIterator()method that returns a NoRewindIterator() instance. This method accepts as arguments either ByLine or ByLength, which refer to the two methods defined in the previous step. This method also needs to accept $numBytes in case ByLength is called. The reason we need a NoRewindIterator() instance is to enforce the fact that we're reading through the file only in one direction in this example:
```
public function getIterator($type = 'ByLine', $numBytes = NULL)
{
  if(!in_array($type, $this->allowedTypes)) {
    $message = __METHOD__ . ' : ' . self::ERROR_TYPE . PHP_EOL;
    throw new InvalidArgumentException($message);
  }
  $iterator = 'fileIterator' . $type;
  return new NoRewindIterator($this->$iterator($numBytes));
}
```

How it works...

First of all, we take advantage of the autoloading class defined in Chapter 1, Building a Foundation, to obtain an instance of Application\Iterator\LargeFile in a calling program, chap_02_iterating_through_a_massive_file.php:

define('MASSIVE_FILE', '/../data/files/war_and_peace.txt');
require __DIR__ . '/../Application/Autoload/Loader.php';
Application\Autoload\Loader::init(__DIR__ . '/..');

Next, inside a try {...} catch () {...} block, we get an instance of a ByLine iterator:

try {
  $largeFile = new Application\Iterator\LargeFile(__DIR__ . MASSIVE_FILE);
  $iterator = $largeFile->getIterator('ByLine');

We then provide an example of something useful to do, in this case, defining an average of words per line:

$words = 0;
foreach ($iterator as $line) {
  echo $line;
  $words += str_word_count($line);
}
echo str_repeat('-', 52) . PHP_EOL;
printf("%-40s : %8d\n", 'Total Words', $words);
printf("%-40s : %8d\n", 'Average Words Per Line', 
($words / $iterator->getReturn()));
echo str_repeat('-', 52) . PHP_EOL;

We then end the catch block:

} catch (Throwable $e) {
  echo $e->getMessage();
}

The expected output (too large to show here!) shows us that there are 566,095 words in the project Gutenberg version of War and Peace. Also, we find the average number of words per line is eight.

Previous Chapter

Improving performance using PHP 7 enhancements

Next Chapter

Uploading a spreadsheet into a database

Table of Contents for PHP 7: Real World Application Development

Iterating through a massive file

How to do it...

How it works...

Table of Contents for
PHP 7: Real World Application Development