There is a wide range of data on the Internet. This data can be extracted and used. To do this, it's important to follow certain steps and know how to go about it. On the SeLoger site, you'll find a wide range of property listings. The data available on the site can be extremely useful. How do you analyse the SeLoger site correctly in order to retrieve data? Find out in this article.
What are the 3 different types of site used for web scraping?
When you want to learn how to retrieve data from websites like SeLoger.com, the first thing to do is focus on the tools you need to use. There are a wide variety of them. Some are paid for, while others are free. However, you shouldn't rush into anything. Your first impression of the web scraping should not be about that.
Many people try to understand how to retrieve data before they even understand how it is generated on the website. So it's a two-way street. You need to follow the right methodology for scraping websites like SeLoger.com. You need to carry out an analysis beforehand as part of a web scraping operation.
Whenever you come across a website with data that might be of interest to you, using the right method will enable you to identify how the data is generated. Then you can easily retrieve it.
First of all, it's important to know that when it comes to web scraping, there are 3 distinct types of website. These are sites with a APIThe ideal method is to define the category from which you want to retrieve data, i.e. sites that have no Api and load their data on the server side (back) and sites that have no Api and load their data on the browser side (front). The ideal method is to define the category to which the site from which you want to retrieve data belongs. This is the SeLoger site.
About API data
To begin with, you need to ask yourself whether the SeLoger site has an API. To do this, it's important to use your browser's development tools. You can use Mozilla Firefox for this. However, you can also do the same with Google Chrome. After that, you need to access the SeLoger site in question before pressing F12 to activate the development console display.
In the console, go to the "Network" tab. This is a menu that gives you information about all the resources loaded by the site through the browser. By default, all these resources are displayed indiscriminately by the console.
The next step is to filter by pressing the "XHR" filter button. If nothing is displayed, remember to refresh the pages. There's no need to go into detail about the XHR object. However, you should be aware that this is where you will find the calls made to an API, if possible. This is where you need to make an extra effort.
For those in the know, this part of the analysis requires you to rummage around without any particular landmark. To begin with, you need to click on the first item, and in the section on the right that gives information about the call you have selected, you need to access the "Response" tab.
After that, the data returned by the selected call will be displayed. You should therefore scroll through the different calls on the left of the console. Each time, you should also look in the right-hand part of the console (under the 'Response' tab) to see if there is any information relating to the data you are trying to scrape.
As you can see, this is a methodology that requires improvisation. However, web scraping is not an exact science. The key is to find what you're looking for. In the column on the right, there are the calls made that return a list of information about the housing ads displayed on the page. Once you have placed your finger on this call, click on the "Headers" tab in the right-hand column. The very first information accessible here is the URL of the request.
From here, you access the URL of the API to which the site makes its calls to load the data you see displayed on the page. You can first stop the diagnostics at this point. This will tell you that the site you are on belongs to the first category of the 3 types of site listed above.
What about the data generated on the browser or server side?
If you don't find any trace of an API after completing the previous step, it means that you need to know which of the two remaining categories the website belongs to. This doesn't have to be very complicated. It's very easy to distinguish between sites that load their data on the server side and those that load it on the browser side.
On the online site you are analysing, you need to look at the source code. If you find information that interests you there, it means that this data has not been generated by the browser. It was already present in the initial code received. It's easy to see that it was generated on the server side.
In the new tab containing the source, you can perform a text search on any of the information you are interested in. This will tell you whether the data you are interested in is generated on the browser side or on the server side, based on this observation.
Possible avenues
It is now possible to classify the site you have analysed. You need to know why you need to have an idea of what type of site it is. This is vital information. Depending on the type of site, the means that need to be implemented to carry out the most appropriate scraping differ.
For websites with an API, where the API is not subject to a security system, it is possible to retrieve data without having to code. You can do this via the browser, the software postman or easy-to-use online tools.
For websites that don't have an Api and that load their data server-side, you can retrieve data without even coding using the Google Chrome "Web Scraper" extension in some cases. All you need to do is follow the right steps.
The same applies to websites that do not have an API and load their data on the browser side (front-end). It is possible to scrape data without having to code using the Google Chrome "Web Scraper" extension in certain cases and to a certain extent.
For people with web development or who have mastered a web scraper tool, the diagnostic will help them to know how to go about it.
If you are used to contacting service providers for web scraping assignments, this analysis will give you a better understanding of the work they will be doing. This will enable you to better understand their explanations during your discussions.
Scraping page by page
Would you like to collect all the rental ads for flats in Paris and compare their prices by surface area, arrondissement, agency or number of bedrooms? You can do this by accessing the SeLoger site and manually copying page by page the details of each offer. This can take hours, as there are thousands of listings to go through.
However, there are computers that are good at performing repetitive tasks. To begin with, you need to create a list of pages to consult. On the site, the search results for flats in Paris are displayed on several pages. There are 20 ads on each page. The addresses on the first few pages are virtually identical.
To obtain a list of web pages to consult, simply change the last digit indicating the page number. Once you have the list of web pages, you need to access each page. You can do this easily in python.
All these steps will enable you to collect information on the SeLoger.com website. All you have to do is follow them carefully, and you're in business.
p