visit
“Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.”This also means you can login to a website, submit a form, upload a file and whatnot all using this library siting at your server (or even on your local system) and still do all the things you would normally do.To scrap any website you basically need to first visit the website and do recon.And how do you do that? Simple — Inspector tool. Most modern day web browsers have “inspect” tool when you right click a web page (For Safari, you need to enable the option “Show Develop menu in menu bar” under the Advanced tab of Preference). Once you select inspect tool, the browser’s developer window (or console window) will open up docked on one side of window. The first tab is the “Elements” tab. Inside this tab you get to see the elements and their styling.This is very important to see the “name” attribute of the elements, to see if there are any hidden elements, find the form or its action attribute etc., you can use this “inspect” tool.This is recon.Ok, now let’s see this in action. Suppose your objective is to find the list of repositories that a user has created on their GitHub account (you could do more, but let’s just play nice ;-) ).Let’s create a folder in your localhost directory by the name “crawler” (name is irrelevant).Then head over to the library’s repo page on GitHub to download from or git clone After unzipping (if you downloaded) or git cloning, you’ll see a directory named “Goutte”.You’ll be needing composer installed as well to make this library work. Download it from
Now execute this command to add dependency libraries — composer require fabpot/goutte
Give this command a minute or so finish. After successfully executing this command vendor directory (containing dependencies) and composer file is added.Now we are ready. Let’s create an interface for the user. Nothing fancy, simple HTML login. Create an index.php file with the following code.<form method="POST">
<div class=“form-group">
<label for=“git_email">Email address</label>
<input type="email" class="form-control" id=“git_email" name=“git_email" placeholder="Enter email">
</div>
<div class="form-group">
<label for=“git_pwd">Password</label>
<input type="password" class="form-control" id="git_pwd" name="git_pwd" placeholder="Password">
</div> <button type="submit" class="btn btn-primary">Submit</button>
</form>
if(isset($_POST["“git_email"]) && isset($_POST["git_pwd"]) && !empty($_POST["“git_email"]) && !empty($_POST["git_pwd"])){
//check if the email and password params are there
}
require_once("vendor/autoload.php");
require_once(“Goutte/Goutte/Client.php");
use Goutte\Client;
$client = new Client();
//initialise the crawler with request type get and the login url
$crawler = $client->request('GET', '//github.com/login');
//find the form
$form = $crawler->selectButton('Sign in')->form();
//set the post params
$form->setValues(['login' => $_POST["git_email"], 'password' => $_POST[“git_pwd"]]);
// submit the form
$crawler = $client->submit($form);
//variable to store the logged in username
$username = “";
//Find all the meta tag
$crawler->filter('meta')->each(function ($node) {
global $username;//This is important, since you need to set the variable
//Stop at the meta tag that has “octolytics-actor-login" as it’s name
if(trim($node->attr("name")) == "octolytics-actor-login"){
//set the variable value and return
$username = ($node->attr("content"));
return;
}
});
if($username == “”){
//give error message
echo “<center>The credentials entered are invalid</center>”;
}
$crawler = $client->request('GET', '//github.com/'.$username.'?tab=repositories');
//loop through all “a” tag contained inside list element with “source” class
$crawler->filter('li.source a')->each(function ($node) {
//This additional check is to determine we only get repo name
if(is_numeric($node->text()) === false){
echo $node->text();
echo "<br/>";
}
});
Previously published at