this article was originally published on Dev

Web Scraping in PHP using Goutte II

In the last article, we got introduced to web scraping and we looked into Goutte, a wonderful PHP web scraping library . In this article, we would be putting our knowledge to practice by scraping the website of the Punch . To be more specific, we would be scraping the punch to get the lastest news https://punchng.com/topics/news headlines 😎 .

Let’s get right into it 💪 !

NB : This is for testing purposes only, I do not in any way intend to reproduce the material gotten from the Punch and I do not advice you to do so as that would be copyright infringement .

First things first, we set up Composer autoloading, import the Goutte namespace and we instantiate a new Goutte Client:

    require "vendor/autoload.php";
    use Goutte\Client;
    $client = new Client();

The next step is to send a request via the $client object . The $client object returns a crawler instance . It is this instance that we use to apply our filters .

     $crawler = $client->request('GET',"https://punchng.com/topics/news");

On the front page of the Punch news page are article boxes . Each article has its own box and a heading ( The headline ) with the class “.seg-title” . We want to select all the headlines (.seg-title) on the page and then take each of them one by one . We do it with this:

     $crawler->filter('.seg-title')->each(function ($node){


     });

Notice the method each() ? The each() method allows us to iterate over the current selection(node list) when it contains more than one node . As we mentioned above, we are selecting each of the headlines (.seg-title) hence we have more than one node and we want to iterate through them . Underground, the each() method accepts an instance of an anonymous function, loops through the current node list and then passes a node on each iteration to the closure thus allowing us to access the current node ( $node ) in the closure .

     public function each(\Closure $closure)
     {
          $data = array();
          foreach ($this->nodes as $i => $node) {
              $data[] = $closure($this->createSubCrawler($node), $i);
          }

          return $data;
      }

Alright, the next thing we want to do is extract the text from the current node .

     $crawler->filter('.seg-title')->each(function ($node){
         $headline = $node->text();
         echo $headline;
     });

We get the textual content of the node by calling the method text() . The next thing we do is print out the headline and there we have it ! We would always get the latest 10 news headlines on the punch printed out to us whenever we run this script . Like I said in the previous article, when it comes to scraping, almost anything is possible ( even logging in and filling forms ) . The limit is your mind 😊 . I honestly wish we could go deeper but sadly that’s all for now 😅 .

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .