Part of the testing and debugging process involves incorporating realistic test data. In some cases, especially when testing database access and producing benchmarks, large amounts of test data are needed. One way in which this can be accomplished is to incorporate a process of scraping data from websites, and then putting the data together in realistic, yet random, combinations to be inserted into a database.

\n", which gives us this:
prospects table. Here is the SQL statement used to create this table:CREATE TABLE 'prospects' (
'id' int(11) NOT NULL AUTO_INCREMENT,
'first_name' varchar(128) NOT NULL,
'last_name' varchar(128) NOT NULL,
'address' varchar(256) DEFAULT NULL,
'city' varchar(64) DEFAULT NULL,
'state_province' varchar(32) DEFAULT NULL,
'postal_code' char(16) NOT NULL,
'phone' varchar(16) NOT NULL,
'country' char(2) NOT NULL,
'email' varchar(250) NOT NULL,
'status' char(8) DEFAULT NULL,
'budget' decimal(10,2) DEFAULT NULL,
'last_updated' datetime DEFAULT NULL,
PRIMARY KEY ('id'),
UNIQUE KEY 'UNIQ_35730C06E7927C74' ('email')
) ENGINE=InnoDB DEFAULT CHARSET=utf8;id, which is auto-generated:namespace Application\Test;
use PDO;
use Exception;
use DateTime;
use DateInterval;
use PDOException;
use SplFileObject;
use InvalidArgumentsException;
use Application\Database\Connection;
class FakeData
{
// data generation methods here
}const MAX_LOOKUPS = 10; const SOURCE_FILE = 'file'; const SOURCE_TABLE = 'table'; const SOURCE_METHOD = 'method'; const SOURCE_CALLBACK = 'callback'; const FILE_TYPE_CSV = 'csv'; const FILE_TYPE_TXT = 'txt'; const ERROR_DB = 'ERROR: unable to read source table'; const ERROR_FILE = 'ERROR: file not found'; const ERROR_COUNT = 'ERROR: unable to ascertain count or ID column missing'; const ERROR_UPLOAD = 'ERROR: unable to upload file'; const ERROR_LOOKUP = 'ERROR: unable to find any IDs in the source table'; protected $connection; protected $mapping; protected $files; protected $tables;
protected $alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'; protected $street1 = ['Amber','Blue','Bright','Broad','Burning', 'Cinder','Clear','Dewy','Dusty','Easy']; // etc. protected $street2 = ['Anchor','Apple','Autumn','Barn','Beacon', 'Bear','Berry','Blossom','Bluff','Cider','Cloud']; // etc. protected $street3 = ['Acres','Arbor','Avenue','Bank','Bend', 'Canyon','Circle','Street']; protected $email1 = ['northern','southern','eastern','western', 'fast','midland','central']; protected $email2 = ['telecom','telco','net','connect']; protected $email3 = ['com','net'];
Connection object, used for database access, an array of mappings to the fake data:public function __construct(Connection $conn, array $mapping)
{
$this->connection = $conn;
$this->mapping = $mapping;
}public function getAddress($entry)
{
return random_int(1,999)
. ' ' . $this->street1[array_rand($this->street1)]
. ' ' . $this->street2[array_rand($this->street2)]
. ' ' . $this->street3[array_rand($this->street3)];
}public function getPostalCode($entry, $pattern = 1)
{
return $this->alpha[random_int(0,25)]
. $this->alpha[random_int(0,25)]
. random_int(1, 99)
. ' '
. random_int(1, 9)
. $this->alpha[random_int(0,25)]
. $this->alpha[random_int(0,25)];
}$entry array, with parameters, and use those parameters to create the name portion of the address:public function getEmail($entry, $params = NULL)
{
$first = $entry[$params[0]] ?? $this->alpha[random_int(0,25)];
$last = $entry[$params[1]] ?? $this->alpha[random_int(0,25)];
return $first[0] . '.' . $last
. '@'
. $this->email1[array_rand($this->email1)]
. $this->email2[array_rand($this->email2)]
. '.'
. $this->email3[array_rand($this->email3)];
}$entry array, with parameters. The parameters would be an array where the first value is a start date. The second parameter would be the maximum number of days to subtract from the start date. This effectively lets you return a random date from a range. Note that we use DateTime::sub() to subtract a random number of days. sub() requires a DateInterval instance, which we build using P, the random number of days, and then 'D':public function getDate($entry, $params)
{
list($fromDate, $maxDays) = $params;
$date = new DateTime($fromDate);
$date->sub(new DateInterval('P' . random_int(0, $maxDays) . 'D'));
return $date->format('Y-m-d H:i:s');
}public function getEntryFromFile($name, $type)
{
if (empty($this->files[$name])) {
$this->pullFileData($name, $type);
}
return $this->files[$name][
random_int(0, count($this->files[$name]))];
}Exception if the specified file is not found. The file type is identified as one of our class constants: FILE_TYPE_TEXT or FILE_TYPE_CSV. Depending on the type, we use either fgetcsv() or fgets():public function pullFileData($name, $type)
{
if (!file_exists($name)) {
throw new Exception(self::ERROR_FILE);
}
$fileObj = new SplFileObject($name, 'r');
if ($type == self::FILE_TYPE_CSV) {
while ($data = $fileObj->fgetcsv()) {
$this->files[$name][] = trim($data);
}
} else {
while ($data = $fileObj->fgets()) {
$this->files[$name][] = trim($data);
}
}public function getEntryFromTable($tableName, $idColumn, $mapping)
{
$entry = array();
try {
if (empty($this->tables[$tableName])) {
$sql = 'SELECT ' . $idColumn . ' FROM ' . $tableName
. ' ORDER BY ' . $idColumn . ' ASC LIMIT 1';
$stmt = $this->connection->pdo->query($sql);
$this->tables[$tableName]['first'] =
$stmt->fetchColumn();
$sql = 'SELECT ' . $idColumn . ' FROM ' . $tableName
. ' ORDER BY ' . $idColumn . ' DESC LIMIT 1';
$stmt = $this->connection->pdo->query($sql);
$this->tables[$tableName]['last'] =
$stmt->fetchColumn();
}$result = FALSE; $count = self::MAX_LOOKUPS; $sql = 'SELECT * FROM ' . $tableName . ' WHERE ' . $idColumn . ' = ?'; $stmt = $this->connection->pdo->prepare($sql);
do...while loop. The reason for this is that we need to run the query at least once to achieve results. Only if we do not arrive at a result do we continue with the loop. We generate a random number between the lowest ID and the highest ID, and then use this in a parameter in the query. Notice that we also decrement a counter to prevent an endless loop. This is in case the IDs are not sequential, in which case we could accidentally generate an ID that does not exist. If we exceed the maximum attempts, still with no results, we throw an Exception:do {
$id = random_int($this->tables[$tableName]['first'],
$this->tables[$tableName]['last']);
$stmt->execute([$id]);
$result = $stmt->fetch(PDO::FETCH_ASSOC);
} while ($count-- && !$result);
if (!$result) {
error_log(__METHOD__ . ':' . self::ERROR_LOOKUP);
throw new Exception(self::ERROR_LOOKUP);
}
} catch (PDOException $e) {
error_log(__METHOD__ . ':' . $e->getMessage());
throw new Exception(self::ERROR_DB);
}foreach ($mapping as $key => $value) {
$entry[$value] = $result[$key] ?? NULL;
}
return $entry;
}getRandomEntry() method, which generates a single array of fake data. We loop through $mapping one entry at a time and examine the various parameters:public function getRandomEntry()
{
$entry = array();
foreach ($this->mapping as $key => $value) {
if (isset($value['source'])) {
switch ($value['source']) {source parameter is used to implement what effectively serves as a Strategy Pattern. We support four different possibilities for source, all defined as class constants. The first one is SOURCE_FILE. In this case, we use the getEntryFromFile() method discussed previously: case self::SOURCE_FILE :
$entry[$key] = $this->getEntryFromFile(
$value['name'], $value['type']);
break;$mapping array: case self::SOURCE_CALLBACK :
$entry[$key] = $value['name']();
break;SOURCE_TABLE option uses the database table defined in $mapping as a lookup. Note that getEntryFromTable(), discussed previously, is able to return an array of values, which means we need to use array_merge() to consolidate the results: case self::SOURCE_TABLE :
$result = $this->getEntryFromTable(
$value['name'],$value['idCol'],$value['mapping']);
$entry = array_merge($entry, $result);
break;SOURCE_METHOD option, which is also the default, uses a method already included with this class. We check to see whether parameters are included, and, if so, add those to the method call. Note the use of {} to influence interpolation. If we made a $this->$value['name']() PHP 7 call, due to the Abstract Syntax Tree (AST) rewrite, it would interpolate like this, ${$this->$value}['name'](), which is not what we want: case self::SOURCE_METHOD :
default :
if (!empty($value['params'])) {
$entry[$key] = $this->{$value['name']}(
$entry, $value['params']);
} else {
$entry[$key] = $this->{$value['name']}($entry);
}
}
}
}
return $entry;
}getRandomEntry() to produce multiple lines of fake data. We also add an option to insert to a destination table. If this option is enabled, we set up a prepared statement to insert, and also check to see whether we need to truncate any data currently in this table:public function generateData(
$howMany, $destTableName = NULL, $truncateDestTable = FALSE)
{
try {
if ($destTableName) {
$sql = 'INSERT INTO ' . $destTableName
. ' (' . implode(',', array_keys($this->mapping))
. ') '. ' VALUES ' . ' (:'
. implode(',:', array_keys($this->mapping)) . ')';
$stmt = $this->connection->pdo->prepare($sql);
if ($truncateDestTable) {
$sql = 'DELETE FROM ' . $destTableName;
$this->connection->pdo->query($sql);
}
}
} catch (PDOException $e) {
error_log(__METHOD__ . ':' . $e->getMessage());
throw new Exception(self::ERROR_COUNT);
}getRandomEntry(). If a database insert is requested, we execute the prepared statement in a try/catch block. In any event, we turn this method into a generator using the yield keyword:for ($x = 0; $x < $howMany; $x++) {
$entry = $this->getRandomEntry();
if ($insert) {
try {
$stmt->execute($entry);
} catch (PDOException $e) {
error_log(__METHOD__ . ':' . $e->getMessage());
throw new Exception(self::ERROR_DB);
}
}
yield $entry;
}
}The first thing to do is to ensure you have the data ready for random data generation. In this recipe, we will presume that the destination table is prospects, which has the following SQL database definition shown in step 7.
As a data source for names, you could create text files for first names and surnames. In this illustration, we will reference the data/files directory, and the files first_names.txt and surnames.txt. For city, state or province, postal code, and country, it might be useful to download the data from a source such as http://www.geonames.org/, and upload to a world_city_data table. For the remaining fields, such as address, e-mail, status, and so on, you could either use methods built into FakeData, or define callbacks.
Next, be sure to define Application\Test\FakeData, adding the content discussed in steps 8 to 29. After you have finished, create a calling program called chap_13_fake_data.php, which sets up autoloading and uses the appropriate classes. You should also define constants that match the path to the database configuration, and names files:
<?php
define('DB_CONFIG_FILE', __DIR__ . '/../config/db.config.php');
define('FIRST_NAME_FILE', __DIR__ . '/../data/files/first_names.txt');
define('LAST_NAME_FILE', __DIR__ . '/../data/files/surnames.txt');
require __DIR__ . '/../Application/Autoload/Loader.php';
Application\Autoload\Loader::init(__DIR__ . '/..');
use Application\Test\FakeData;
use Application\Database\Connection;Next, define a mapping array that uses the column names in the destination table (prospects) as a key. You need to then define sub-keys for source, name, and any other parameters that are required. For starters, 'first_name' and 'last_name' will both use a file as a source, 'name' points to the name of the file, and 'type' indicates a file type of text:
$mapping = [ 'first_name' => ['source' => FakeData::SOURCE_FILE, 'name' => FIRST_NAME_FILE, 'type' => FakeData::FILE_TYPE_TXT], 'last_name' => ['source' => FakeData::SOURCE_FILE, 'name' => LAST_NAME_FILE, 'type' => FakeData::FILE_TYPE_TXT],
The 'address', 'email', and 'last_updated' all use built-in methods as a data source. The last two also define parameters to be passed:
'address' => ['source' => FakeData::SOURCE_METHOD,
'name' => 'getAddress'],
'email' => ['source' => FakeData::SOURCE_METHOD,
'name' => 'getEmail',
'params' => ['first_name','last_name']],
'last_updated' => ['source' => FakeData::SOURCE_METHOD,
'name' => 'getDate',
'params' => [date('Y-m-d'), 365*5]]The 'phone', 'status' and 'budget' could all use callbacks to provide fake data:
'phone' => ['source' => FakeData::SOURCE_CALLBACK,
'name' => function () {
return sprintf('%3d-%3d-%4d', random_int(101,999),
random_int(101,999), random_int(0,9999)); }],
'status' => ['source' => FakeData::SOURCE_CALLBACK,
'name' => function () { $status = ['BEG','INT','ADV'];
return $status[rand(0,2)]; }],
'budget' => ['source' => FakeData::SOURCE_CALLBACK,
'name' => function() { return random_int(0, 99999)
+ (random_int(0, 99) * .01); }]And finally, 'city' draws its data from a lookup table, which also gives you data for the fields listed in the 'mapping' parameter. You can then leave those keys undefined. Notice that you should also specify the column representing the primary key for the table:
'city' => ['source' => FakeData::SOURCE_TABLE, 'name' => 'world_city_data', 'idCol' => 'id', 'mapping' => [ 'city' => 'city', 'state_province' => 'state_province', 'postal_code_prefix' => 'postal_code', 'iso2' => 'country'] ], 'state_province'=> [], 'postal_code' => [], 'country' => [], ];
You can then define the destination table, a Connection instance, and create the FakeData instance. A foreach() loop will suffice to display a given number of entries:
$destTableName = 'prospects';
$conn = new Connection(include DB_CONFIG_FILE);
$fake = new FakeData($conn, $mapping);
foreach ($fake->generateData(10) as $row) {
echo implode(':', $row) . PHP_EOL;
}The output, for 10 rows, would look something like this:

Here is a summary of websites with various lists of data that could be of use when generating test data:
|
Type of Data |
URL |
Notes |
|---|---|---|
|
Names | ||
|
http://www.babynamewizard.com/international-names-lists-popular-names-from-around-the-world | ||
|
Raw Name Lists | ||
|
US male first names | ||
|
US female first names | ||
|
US male first names | ||
|
Last Names |
US surnames from census | |
|
http://surname.sofeminine.co.uk/w/surnames/most-common-surnames-in-great-britain.html | ||
|
List of US surnames in the form of a PHP array | ||
|
Dutch surnames | ||
|
International surnames; just change the last letter(s) to get a list of names starting with that letter(s) | ||
|
Cities |
World cities | |
|
Postal Codes |
US only; includes cities, postal codes, latitude and longitude | |
|
International; city names, postal codes, EVERYTHING!; free download |