w3programmers - Data-scrapping with PHP cURL

Data-scrapping with PHP cURL

cURL is a library to communicate with a remote server. With a large number of protocol support it’s one of the most popular library of its kind. cURL can be used to access remote documents, upload files, submit forms and many more. In this tutorial we will learn how to access html files and submit a form.

Start off by writing a crawler class and adding a few properties to hold some default values.

    public $url;
    public $request_type;
    public $data;
    public $post_params;

Now, write a constructor to assign the values on object initialization.

function __construct($url = '' , $request_type = 'GET'){
        $this->url = $url;
        $this->request_type = $request_type;
        $this->data = '';
        $this->post_params = array();

The main crawler function will capture the data with cURL. Then the data will be assigned to $data property.

/**crawl a document **/
    function crawl(){
        $curl = curl_init( $this->url );
        curl_setopt($curl, CURLOPT_HEADER, false);
        curl_setopt($curl, CURLOPT_TIMEOUT, 60);
        curl_setopt($curl, CURLOPT_USERAGENT, 'cURL PHP');
        curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
        $this->data = curl_exec($curl);
        return $this; //make it a chainable method

Notice that the crawler function is returning $this. It makes the method chainable. If you don’t know about method chaining, don’t worry. We will explore it later. Now, we have the crawler. We also need a parser to parse the captured data into a suitable format. Lets make the parser.

 /** Parse result data **/
    function parse(){
        $result = array();
        $count = 0;
        $dom = new DOMDocument;
        $dom->preserveWhiteSpace = false;
        $xpath = new DOMXPath($dom);
        $news = $xpath->query('//td[@bgcolor="#DDDDDD"]/table/tr[position()=2]/td[position()=2]');
        foreach( $news as $n){
            $result[] =   $n->nodeValue;
            if ($count >9)
                break; //we just need  10 results. Index starts from 0
        return $result;

We have used PHP DOM extension to parse the html output. Why not regular expressions? Well, html document is complex. DOM gives much better control over the document. The XPath query here is written to only parse the news of table of DES web site. You may need to adjust it based on the structure of your document.

Now It’s time to see some result of our hard work! Instantiate an object. Before you run the code, set error reporting to 0. DSE site contains invalid markups, script may throw lots of warning messages.

$dse = new CURL_CRAWLER('http://www.dsebd.org/display_news.php');
echo "<pre>";
print_r( $dse-&gt;crawl()-&gt;parse() );
echo "<pre>";

The output is an array similar to something shown in the screenshot below.
Output array

Why stop there? You can even submit form with cURL. Make a PHP form that accepts 4 POST parameters – name, email, subject, body. After submitting the form via cURL, a mail will be sent!

if ( !empty($_POST['name'])    &&
     !empty($_POST['email'])   &&
     !empty($_POST['subject']) &&
    //good. now send email
    $body = "Hi, "  . htmlentities( $_POST['name'] ) . PHP_EOL . PHP_EOL; //PHP_EOL enters a line break "\n"
    $body .= $_POST['body'] . PHP_EOL;
    mail( $_POST['email'], $_POST['subject'], $_POST['body'] );
    echo 'Mail sent!';

    echo "Oops! Something missing.. " . PHP_EOL;

Save it as form.php in a directory under your local web server.

To submit forms we need to do a POST request with cURL and pass the post parameters as an associative array.

/** submit a form with cURL **/
    function submit_form($post_params){
        $this->post_params = $post_params;
        $this->request_type = 'POST'; //We know it must be post for form submit

        $curl = curl_init( $this->url );
        $curl_options = array(
            CURLOPT_RETURNTRANSFER => true, //Don't echo result, return it
            CURLOPT_USERAGENT      => 'cURL PHP',
            CURLOPT_TIMEOUT        => 60,
            CURLOPT_HEADER         => false, // don't include the header in output
            CURLOPT_POST           => true,  //We are doing a POST request
            CURLOPT_POSTFIELDS     => $this->post_params
        curl_setopt_array($curl, $curl_options);
        $this->data = curl_exec($curl);
        return $this; //method chaining $this->chain()->chain()

A couple of changes here. Instead of calling curl_setopt() a coupe of times, we passed all the parameters as an array and used curl_setopt_array(). Clearly, programmers are lazy people, they love shortcuts! CURLOPT_POSTFIELDS is an array containing all post parameters/form data.

Time to see the result! Create an array containing all post fields.

$form_fields = array(
                     'name'     => 'Sombody',
                     'email'    => 'sombody@somebody.com',
                     'subject'  => 'Hey, I am a PHP script!',
                     'body'     => 'Just to let you know that this mail is sent by your php script.'


Now, same as before, instantiate an object, pass the url and the array. See the fruits of your hard work!

$form = new CURL_CRAWLER('http://localhost.net/form.php');
echo $form->submit_form($form_fields)->data;

The output of your script should be “Mail Sent!”. We have learnt a couple of important things besides cURL – PHP DOM to parse html documents, XPath query to search and select dom nodes. And finally method chaining.

6 comments to “Data-scrapping with PHP cURL”
  1. Really Good Job!
    But I want to scrap ajax loaded data. I have also try various codes in php but still not success. Can any find any way to scrap ajax loaded data(data loaded by ajax on target URL).

Leave a Reply

Your email address will not be published. Required fields are marked *