this article was originally published on Dev
Web Scraping in PHP using Goutte II
In the last article, we got introduced to web scraping and we looked into Goutte, a wonderful PHP web scraping library . In this article, we would be putting our knowledge to practice by scraping the website of the Punch . To be more specific, we would be scraping the punch to get the lastest news https://punchng.com/topics/news headlines 😎 .
Let’s get right into it 💪 !
NB : This is for testing purposes only, I do not in any way intend to reproduce the material gotten from the Punch and I do not advice you to do so as that would be copyright infringement .
First things first, we set up Composer autoloading, import the Goutte namespace and we instantiate a new Goutte Client:
require "vendor/autoload.php";
use Goutte\Client;
$client = new Client();
The next step is to send a request via the $client object . The $client object returns a crawler instance . It is this instance that we use to apply our filters .
$crawler = $client->request('GET',"https://punchng.com/topics/news");
On the front page of the Punch news page are article boxes . Each article has its own box and a heading ( The headline ) with the class “.seg-title” . We want to select all the headlines (.seg-title) on the page and then take each of them one by one . We do it with this:
$crawler->filter('.seg-title')->each(function ($node){
});
Notice the method each()
? The each()
method allows us to iterate over the current selection(node list) when it contains more than one node . As we mentioned above, we are selecting each of the headlines (.seg-title) hence we have more than one node and we want to iterate through them . Underground, the each()
method accepts an instance of an anonymous function, loops through the current node list and then passes a node on each iteration to the closure thus allowing us to access the current node ( $node
) in the closure .
public function each(\Closure $closure)
{
$data = array();
foreach ($this->nodes as $i => $node) {
$data[] = $closure($this->createSubCrawler($node), $i);
}
return $data;
}
Alright, the next thing we want to do is extract the text from the current node .
$crawler->filter('.seg-title')->each(function ($node){
$headline = $node->text();
echo $headline;
});
We get the textual content of the node by calling the method text()
. The next thing we do is print out the headline and there we have it ! We would always get the latest 10 news headlines on the punch printed out to us whenever we run this script . Like I said in the previous article, when it comes to scraping, almost anything is possible ( even logging in and filling forms ) . The limit is your mind 😊 . I honestly wish we could go deeper but sadly that’s all for now 😅 .
For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte .