this article was originally published on Dev

Web Scraping in PHP using Goutte

Today I would be talking about something very common, Web Scraping. Depending on your needs or a client’s needs, situations may arise when you may need to extract data from a webpage.

What is Web Scraping ?

According to WebHarvy, Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting etc.) is a technique employed to extract large amounts of data from websites. In its simplest form, web scraping is getting the contents of a webpage via a script. Alright, let’s move on to web scraping in PHP. Recently, I needed to scrape a site for a client in PHP so I looked for articles that talked about web scraping in PHP and I found out that there were few and most of them were pretty outdated.

However, in my research, I came across Goutte ; a (wonderful) screen scraping and web crawling library for PHP. At its core, Goutte is a wrapper around three of Symfony’s components ( God bless Fabien 🙌) ; BrowserKit, CssSelector and DomCrawler. It is important for us to understand what each of these components does as it helps us to understand just how powerful Goutte is.

BrowserKit ; Simply put, the BrowserKit component simulates the behaviour of a real browser. It is the foundational element of Goutte.

DomCrawler;  The DomCrawler component eases the navigation of the DOM ( Document Object Model ). The DomCrawler allows us to navigate the dom like this:

    $crawler = $crawler->filter('body > p');

We can also traverse through nodes on the DOM using some of the methods that it provides. For example, if we want to get the first paragraph in the body of the page we could do this:

    $crawler->filter('body > p')->eq(0);

The eq() method is zero indexed and it takes a number specifying the position of the element we want to access. There are other methods such as siblings(), first() [an alias of eq(0), underground it just calls eq(0) ], last() etc.

CssSelector; The CssSelector is a wonderful component that allows us to select elements via their CSS selectors. It does this by converting the CSS selectors to their XPath equivalents. So for example say we wanted to select an element with a class called “fire” we could do this:

    $crawler->filter('.fire');

The CssSelector component is so amazing that it even supports CSS such as ;

    $crawler->filter('div[style*="max-height:175px; overflow: hidden;"]');

The above means that we are looking for a div element with an inline style attribute of "style=max-height:175px; overflow: hidden;"

For more information, please do well to read the docs of DomCrawler, CssSelector and Goutte.

Alright now that we have a bit of an idea about the three major components, it is time for us to bring everything together and actually scrape something. As you may have realised by now,when it comes to scraping, there is no laid down way to do it. You are free to explore and try out so many ways to get your data. The only limit you have is your creativity. There are times where I have had to combine the CssSelector and DomCrawler in order to get what I want [ actually, a lot of times ].

In the next post we are going to put everything that we have learnt so far in to play by scraping the website of the Punch.