Python is often the first language that comes to mind when we talk about scraping data from websites. Its powerful libraries and easy syntax have made it a go-to choice for many. But what if I told you there’s a whole world of web scraping beyond Python?
In this article, we’ll explore alternative methods for scraping websites that don’t rely on Python. You might be surprised to learn that you don’t always need to write Python code to gather data from the web. Whether you’re new to coding or a seasoned pro, we’ll walk you through tools and techniques that make web scraping accessible to everyone.
First, let’s revisit the basics. At its core, web scraping is the process of extracting data from websites or web applications. Developers and data enthusiasts use this technique to gather information for analysis, research, or automation.
To showcase the versatility of web scraping, we’ll demonstrate how to extract data using various programming languages. For this blog, we’ll use Scrape It as our example website.
Our task is straightforward: we’ll fetch the HTML content of the Scrape It website and extract the text within the <title>
tag. It’s a simple yet powerful example that highlights the accessibility and practicality of web scraping.
So our goal is to get this text “Scrape IT – Wij scrapen data voor jou” from the website
To get the text we want from the website, we’ll do two things:
- Get the website code: First, we’ll grab the website’s code. It’s like getting a book to find the information we need.
- Find the title: Next, we’ll look through the code to find the title. It’s like searching for a specific word in a book.
Alright, let’s kick things off with a language that holds a special place in many developers’ hearts – C. If you’re like me, C was probably one of the first languages you learned, and it still has that nostalgic charm.
Web scraping using C programming language.
Code:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define MAX_HTML_SIZE 100000 // Maximum size of HTML content to store
int main() {
char html[MAX_HTML_SIZE]; // Buffer to store the HTML content
FILE *curl_output; // File pointer to capture curl output
char *title_start, *title_end; // Pointers to start and end of <title> tag
// Run curl command and capture output
curl_output = popen("curl https://scrape-it.nl/", "r");
if (curl_output == NULL) {
printf("Failed to run curl command.\n");
return 1;
}
// Read the output of curl into html buffer
fread(html, sizeof(char), MAX_HTML_SIZE, curl_output);
// Close the file pointer
pclose(curl_output);
// Find the start of first <title> tag
title_start = strstr(html, "<title>");
if (title_start == NULL) {
printf("No <title> tag found.\n");
return 1;
}
// Move pointer to start of content within <title> tags
title_start += 7; // Move to the position after "<title>"
// Find the end of first <title> tag
title_end = strstr(title_start, "</title>");
if (title_end == NULL) {
printf("Invalid <title> tag.\n");
return 1;
}
// Null-terminate the content within <title> tags
*title_end = '\0';
// Print the content within first <title> tag
printf("Content within <title> tag: %s\n", title_start);
return 0;
}
This code fetches the title of a Scrape IT website, It uses a tool called curl to get the HTML content of the website. Then, it searches for the title within the HTML code and prints it out.
Output:
Web scraping using C #
Code:
using System;
using HtmlAgilityPack;
namespace ScrapeItScrapingCSharp
{
internal class Program
{
static void Main(string[] args)
{
// Create HtmlWeb instance
HtmlWeb web = new HtmlWeb();
// Load website
HtmlDocument doc = web.Load("https://scrape-it.nl/");
// Get title node
HtmlNode titleNode = doc.DocumentNode.SelectSingleNode("//title");
// Check if title node exists
if (titleNode != null)
{
// Print title text
Console.WriteLine("Content within <title> tag: " + titleNode.InnerText);
}
else
{
// Print error message if title node is not found
Console.WriteLine("No <title> tag found.");
}
}
}
}
This code fetches the HTML content of a website and utilizes the HtmlAgilityPack library in C#. With its capabilities, we easily target the <title> element using XPath and extract its text. This straightforward approach simplifies HTML parsing, making it effortless to fetch specific elements from the website.
Output:
Web scraping using Java
Code:
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.io.IOException;
public class Main {
public static void main(String[] args) {
// URL of the website to scrape
String url = "https://scrape-it.nl/";
try {
// Connect to the website and get the HTML document
Document doc = Jsoup.connect(url).get();
// Get the title element
Element titleElement = doc.select("title").first();
// Check if the title element exists
if (titleElement != null) {
// Print the title text
System.out.println("Content within <title> tag: " + titleElement.text());
} else {
// Print error message if title element is not found
System.out.println("No <title> tag found.");
}
} catch (IOException e) {
// Print error message if connection fails
System.out.println("Failed to fetch HTML content: " + e.getMessage());
}
}
}
This Java code fetches the HTML content of a website and employs the Jsoup library. Jsoup facilitates HTML parsing and navigation, allowing us to easily target the <title> element using CSS selector syntax. By retrieving the text of the <title> element, we obtain the title of the website.
Output:
Web scraping using Javascript
Code:
// URL of the website to scrape
const url = 'https://scrape-it.nl/';
// Fetch HTML content
fetch(url)
.then(response => response.text())
.then(html => {
// Parse HTML content
const parser = new DOMParser();
const doc = parser.parseFromString(html, 'text/html');
// Get the title element
const titleElement = doc.querySelector('title');
// Check if the title element exists
if (titleElement) {
// Print the title text
console.log(`Content within <title> tag: ${titleElement.textContent}`);
} else {
// Print error message if title element is not found
console.log('No <title> tag found.');
}
})
.catch(error => {
// Print error message if fetching fails
console.error(`Failed to fetch HTML content: ${error}`);
});
This JavaScript code fetches the HTML content of a website using the native fetch API. By leveraging the DOMParser interface, we parse the HTML content and navigate through the document to target the <title> element. Once the <title> element is identified, we extract its text to obtain the title of the website
Output:
Web scraping using NodeJS
Code:
const axios = require('axios');
const cheerio = require('cheerio');
// URL of the website to scrape
const url = 'https://scrape-it.nl/';
// Fetch HTML content
axios.get(url)
.then(response => {
// Load HTML content into cheerio
const $ = cheerio.load(response.data);
// Get the title element
const titleElement = $('title');
// Check if the title element exists
if (titleElement) {
// Print the title text
console.log(`Content within <title> tag: ${titleElement.text()}`);
} else {
// Print error message if title element is not found
console.log('No <title> tag found.');
}
})
.catch(error => {
// Print error message if fetching fails
console.error(`Failed to fetch HTML content: ${error}`);
});
This Node.js code fetches the HTML content of a website using the axios library, a popular HTTP client for Node.js. Utilizing the cheerio library, we load the HTML content into a virtual DOM and use jQuery-like syntax to traverse and manipulate the HTML structure. By targeting the <title> element, we extract its text to retrieve the title of the website
Output:
What if we aim to perform web scraping using the first programming language ever created?
I asked Google what the first programming language is, and its answer was Fortran.
Web scraping using Fortran
Code:
PROGRAM ReadFile
CHARACTER(100) :: line
INTEGER :: title_start, title_end
CHARACTER(100) :: title
! fetch the page
CALL SYSTEM('curl -s https://scrape-it.nl/ > html_content.txt')
! Open the input file
OPEN(UNIT=10, FILE='html_content.txt', STATUS='OLD', ACTION='READ')
! Read each line of the file
DO
READ(10, '(A)', END=20) line
! Check if the line contains the <title> tag
title_start = INDEX(line, '<title>')
IF (title_start > 0) THEN
! Extract the title text
title_end = INDEX(line(title_start:), '</title>') + title_start - 1
title = line(title_start + LEN('<title>'):title_end - 1)
PRINT *, 'Title:', title
END IF
END DO
20 CONTINUE
! Close the input file
CLOSE(10)
! Prompt for user input to prevent immediate exit
PRINT *, 'Press Enter to exit...'
READ(*, *)
END PROGRAM ReadFile
This Fortran code fetches the HTML content of a website using the curl command, then opens the saved file (html_content.txt) to read its content. It reads each line of the file, searching for the <title> tag. If found, it extracts the text between <title> and </title> and prints it
Output:
Concluding our exploration, we’ve covered the essentials of web scraping in this article. Think of it like choosing tools for a project—whether you prefer Python, C#, Java, or even Fortran, it’s about what suits your style. And hey, I’m not against Python—it’s still fun to code with Python too! But remember, web scraping isn’t dependent on any specific language. So, pick your favorite, dive in, and start uncovering the treasures hidden within the web!