Parsing HTML pages using XPath

In my previous work I spent a lot of time programming automatic parsers for sport results from various websites. I found it quiet hard to find a useful tutorial on parsing HTML pages so I decided to write one. In this short tutorial I'm going to write parser that will crawl pages with English soccer league results using XPath and traversing DOM.

You can download complete source code for this tutorial or just check what the output looks like.

BTW, if you're looking for some more complicated examples on parsing in PHP using XPath, try look at this PHP Documentation Parser. It's a parser that I used for PHP Ninja Manual.

You can see full source code on gist.github.com.

So the first thing we have to do is to download certain page. I chosed livescore.com as my source but the basic principle is the same for any website. Just to be precise we'll fetch http://www.livescore.com/soccer/england/ and crawl all soccer results there.

 1 
 2 
 3 
 4 
 5 
 6 
$curl = curl_init('http://www.livescore.com/soccer/england/');
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.224 Safari/534.10');
$html = curl_exec($curl);
curl_close($curl);

Once we have HTML content we can pass it to the DOM object and then create XPath traversing object.

 1 
 2 
 3 
 4 
$dom = new DOMDocument();
@$dom->loadHTML($html);

$xpath = new DOMXPath($dom);

I put @ before $dom->loadHTML($html) because loadHTML usually rises a lot of warnings and notices, that are not important for us. Although, the HTML page is not valid DOMDocument object is able to construct DOM anyway.

In the next step we have to find some container that we can traverse (usually a div or a table) and catch scores. In my case I used Google Chrome's developers toolbar:

Results I'm looking for start on the highlighted row. This table contains rows with all scores I want to crawl so this is the first container we'll fetch using XPath.

 1 
 2 
 3 
 4 
 5 
// this looks crazy, but basically I just rewritten the bottom
// line from Google Chrome's developers toolbar
$tableRows = $xpath->query('//table[1]//tr[4]//table//tr[1]/td[5]//table//tr');
// array for crawled results
$scores = array();

Now we have all rows with scores and we can check every single line and check whether it contains scores, date, league title, etc ... and make a condition for each of them.

Have a look again to the page's DOM and see what we can expect.

For us there are only three different types of data:

So lets begin with the main loop and check what's in every row's content:

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
foreach ($tableRows as $row) {
    // fetch all 'td' inside this 'tr'
    $td = $xpath->query('td', $row);
    // we'll store information about each match in this array
    $match = array();

    if ($td->length == 1 && $xpath->query('td/b', $row)->length == 1) {
        /* ... */
    } elseif ($td->length == 2) { // date
        /* ... */
    } elseif ($td->length == 4) { // match result
        /* ... */
    }
}

The last thing and probably the most complicated is to process every row and decide what to do.

 1 
 2 
 3 
 4 
 5 
 6 
 7 
 8 
 9 
 10 
 11 
 12 
 13 
 14 
 15 
 16 
 17 
 18 
 19 
 20 
 21 
 22 
 23 
 24 
 25 
 26 
 27 
 28 
 29 
 30 
 31 
 32 
 33 
 34 
 35 
 36 
 37 
 38 
 39 
 40 
 41 
 42 
 43 
 44 
 45 
 46 
 47 
 48 
 49 
 50 
 51 
 52 
 53 
 54 
 55 
 56 
 57 
 58 
// check league heading
if ($td->length == 1 && $xpath->query('td/b', $row)->length == 1) {
    // cut the country name and leave just the league
    $league = substr($xpath->query('td/text()', $row)->item(1)->textContent, 3);
    $scores[$league] = array();
} elseif ($td->length == 2) { // date
    $month = date('m', strtotime(substr($td->item(1)->textContent, 0,
                  strpos($td->item(1)->textContent, ' '))));
    $day = sprintf('%02s', preg_replace('/[^0-9]/i', '',
                   substr($td->item(1)->textContent,
                          strpos($td->item(1)->textContent, ' ') + 1)));
    $thisMonth = date('m');
    $thisYear = date('Y');
    // we need to know correct year of the match so, we have to check
    if ($thisMonth - $month < 0) {
        $date = ($thisYear - 1) . '-' . $month . '-' . $day;
    } elseif ($thisMonth - $month > 0) {
        $date = ($thisYear + 1) . '-' . $month . '-' . $day;
    } else {
        $date = $thisYear . '-' . $thisMonth . '-' . $day;
    }
} elseif ($td->length == 4) { // check match result
    /**
     *  first column contains match status:
     *    FT     - match finished
     *    Pen.   - match finished after penalties
     *    Postp. - match postponed to another day
     *    hh:mm  - upcoming match
     *    mm'    - pending match
     */
    $status = preg_replace('/[^a-zA-Z0-9\'\.:]*/i', '',
                           $td->item(0)->textContent);
    if ($status == 'FT') {
        $match['status'] = 'finished';
    } elseif ($status == 'Pen.') {
        $match['status'] = 'penalties';
    } elseif ($status == 'Postp.') {
        $match['status'] = 'postponed';
    } elseif (preg_match('/[0-9]{2}:[0-9]{2}/', $status)) {
        $match['status'] = 'upcoming';
        $match['begin'] = $status;
    } elseif (strpos($status, "'") !== false) {
        $match['status'] = 'pending';
        $match['time'] = trim($status, "'");
    } else {
        $match['status'] = 'unknown';
    }
    
    $match['team1'] = $td->item(1)->textContent; // first team's name
    list($score1, $score2) = explode('-', $td->item(2)->textContent);
    $match['team2'] = $td->item(3)->textContent; // second team's name
    $match['team1score'] = trim($score1); // first team's score
    $match['team2score'] = trim($score2); // second team's score
    $match['date'] = $date; // date of the match
    
    // add match to list of all matches for this league
    $scores[$league][] = $match;
}

Output

When processing is finished $scores variable should look like this:

Array
(
    [Premier League] => Array
        (
            [0] => Array
                (
                    [status] => finished
                    [team1] => Blackpool
                    [team2] => Birmingham C.
                    [team1score] => 1
                    [team2score] => 2
                    [date] => 2011-01-04
                )
            [1] => Array
                (
                    [status] => finished
                    [team1] => Fulham
                    [team2] => West Bromwich A.
                    [team1score] => 3
                    [team2score] => 0
                    [date] => 2011-01-04
                )
        )
    [League Championship] => Array
        (
            [0] => Array
                (
                    [status] => finished
                    [team1] => Cardiff C.
                    [team2] => Leeds U.
                    [team1score] => 2
                    [team2score] => 1
                    [date] => 2011-01-04
                )
        )
    ...
)

Download

Here you can download complete source code.

You can see full source code on gist.github.com.

blog comments powered by Disqus