Divide-and-conquer and parallel processing in PHP
Submitted by johnlim on Sat, 20/09/2008 - 10:24am.
In my previous post Easy Parallel Processing in PHP, I showed you how to implement parallel batch processing using PHP and a web server. In this post, I want to discuss partitioning your tasks so that they become easily parallelized.
The strategy I prefer is divide-and-conquer. This works by recursively breaking down a problem into two or more sub-problems of the same type, until these become simple (and fast) enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.
To illustrate with an example, lets say you have millions of financial payment data records in a database you want to process in parallel using PHP:
- First group your data into logical chunks that need to be processed in one transaction. If you are processing payments for lots of accounts, then grouping them by account number makes lots of sense.
- Decide on how many parallel child processes you want to run. For this example, assume we are on a single dual core CPU server, so it makes sense to only run two concurrent child processes.
- Split all the records by the median account (the median is a statistical term that means the middle record in a range). To make it easy to split by the median
- From the parent process, pass one child process record 1 to median-1 and pass the second child process the median to final records as $_GET parameters.
- For simplicity's sake, both child processes will run the same code, but receive different parameters. The results of the processing can be either stored in the database, or returned back to the parent process by echo'ing the results in the child process.
To find the median of a set of records in a database, I have extended ADOdb, the popular PHP open source database abstraction library I maintain with the following function defined in the ADOConnection class:
function GetMedian($table, $field,$where = '') { $total = $this->GetOne("select count(*) from $table $where"); if (!$total) return false; $midrow = (integer) ($total/2); $rs = $this->SelectLimit("select $field from $table $where order by 1",1,$midrow); if ($rs && !$rs->EOF) return reset($rs->fields); return false; }
If you have a Quad-Core CPU then you can call GetMedian 3 times to break up the data into 4 approximately equal parts, and pass then to 4 child processes:
$mid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO'); if (!$mid) return 'Error'; $lomid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO < $mid"); $himid = $db->GetMedian('PAYMENTS', 'ACCOUNTNO', "where ACCOUNTNO >= $mid");
The above GetMedian function is not particularly optimal when you want need to run it multiple times on the same dataset. Improvements are left to the reader (or in a future blog entry).
PS: Another strategy for parallelization popularised by Google is Map Reduce.
No comments:
Post a Comment