How to Scrape News Content from Popular News Sites?

How-to-Scrape-and-Crawl-Business-Directories-Contact-Details

Getting enough knowledge in a specific business domain is a main mantra for any business to stay aligned with its competitors. News is the best means to learn what’s happening worldwide. Especially for data engineers, news articles are a great way to collect enough data. More data means more insights. But, collecting news, reading, and gaining enough news knowledge is challenging and takes a lot of time. It is impossible to collect data manually. So, fast and quick knowledge in the news industry requires scraping news sites. It plays a vital role in getting essential updates about business in a short time.

This article explains everything you need about news scraping and how to scrape the content quickly and effectively..

What is News Scraping?

It is a term used for scraping news content from news sites available on public online media. It means extracting press releases and updates automatically from news websites. As these sites comprise lots of meaningful public data, reviews on newly launched products, and several keys and announcements for business, hence, are effective for any business success.

Benefits of Scraping News Sites

News aggregation helps to collect important content to attract and grow your target audience and turn your platform into a go-to news outlet. Rather than competing with other brands and sites, the news aggregator will provide them with additional exposure.

There are several benefits of scraping news. A few of them are listed below:

  • It provides updated information about businesses and more.
  • Boosts compliances and operations.
  • Extracts verified and authentic news.
  • It helps in identifying mitigation and risks.
  • It provides information about important business announcements.

The news scraping services aggregate the most highly relevant news content from the website. It lets users avoid the hassles of searching articles, relevant reports, interviews, and more and making them all in one place.

Few Considerations for Scraping Different Types of News Websites

Before you scrape news content from popular news sites, keep in mind the following considerations:

  • Choose your Niche : Although you have a vast news aggregator to collect news on various topics, it is best to stay ahead by picking a niche. Make sure to research and determine which topics can get more clicks. It will make your platform fresh.
  • Use Only Trustworthy Sources : Collect data from credible sources and double-check your facts. Verify all your links and make sure that all the news on your site is current and relevant.
  • Choose How to present the Information on Your Site : You can decide your audience will see your site. You can provide an entire article or a glimpse of the content before redirecting to the source.

List of Data Fields

List-of-Data-Fields

At iWeb Data Scraping, we provide news website data scraping services from several sites, including Yahoo News, MSN, etc

  • News Category and Sub-Category
  • Published Date
  • Published Time
  • News Author
  • News Title
  • News Body

Why Using News Scraping API is the Best Option?

  • Proxy rotation
  • Proxy management services
  • Specialized modules
  • CAPTCHA solving
  • Structured and organized results

In this article, we will create the best news scraper to scrape the latest news articles from various newspapers and store them as text. Hence, we will go through the following two steps to perform an in-depth analysis.

  • Surface-level intro to webpage and HTML
  • I am scraping using Python and BeautifulSoup.

Surface-level intro to webpage and HTML

When we go to any specific URL using a web browser, the particular webpage is a combination of three technologies:

  • HTML : It defines webpage content. It is a standard markup language to add content to the website.
  • CSS : It styles the webpage.
  • JavaScript : It handles all logic handling and web page functionality.

Scraping News Articles Using Python

Python consists of several packages that help in scraping information from a webpage. Here, we will use BeautifulSoup for web scraping.

Install the library packages using the following command.

! pip install beautifulsoup4

We will also use the requests module to help provide BeautifulSoup with the page’s HTML code. Please install it using the following command.

! pip install requests

So, to provide BeautifulSoup with HTML code, we will require a requests module.

Next, install urllib using the following command.

! pip install urllib

urllib is Python’s URL handling module. It helps in fetching URLs.

Importing of Libraries

Now, we will import all the necessary libraries

Import BeautifulSoup on your IDE using the following command.

from bs4 import BeautifulSoup

This library helps get the HTML structure of the desired page and provides functions to access specific elements and extract relevant information.

Now, import urllib using the following command.

import urllib.request, sys, time

To import requests, type the following:

import request

This module sends the HTTP requests to a web server using Python.

Import pandas using the following.

import pandas as pd

We will use this library to make DataFrame.

We-will-use-this-library-to-make-DataFrame

Now, make a simple get request by just fetching a page.

Now,-make-a-simple-get-request

We will consider the requests.get(url) in a try-except block.

We-will-consider-the-requests

We will also use the ‘for’ loop for pagination.

Inspecting the Response Object

See the response code that the server sent back.

page.status_code

Output

Status of Response object 
200

The HTTP 200 OK status response code shows that the requests have succeeded.

Now, access the complete response as text.

page.text

Output

Output

It will return the HTML content of a response object in Unicode.

Look for the specific substring for the test within the response.

if "Politifact" in page.text: 
print("Yes, Scarpe it")

Check for the response’s content type.

print (page.headers.get("content-type", "unknown"))

Output

response's Content Type 
text/html; charset=utf-8

Delaying the Requests Time

We will call the sleep(2) function with a value of 2 seconds.

time.sleep(2)

Extracting Content from HTML

It’s time to parse HTML content to extract the desired value.

(a) Using Regular Expression

import re # put this at the top of the file 
print (re.findall(r'\$[0–9,.]+', page.text))

Output

['$294', '$9', '$5.8']

(b) Using BeautifulSoup

soup = BeautifulSoup (page.text, "html.parser")

The below command will look for all tags - < li > with specific attribute ‘o-listicle__item.’

links-soup.find_all('li',attrs={'class':'o-listicle__item'})

Inspecting the Webpage

To understand the above code, inspect the webpage. It will appear like this.

As we need the news section of a particular page, we will go to that article section by choosing the inspect element option. It will highlight the particular web page section and its HTML source.

We will continue with our code.

print(len(links))

This command will extract the number of news articles on a given page.

Finding Elements and Attributes

Look for all anchor tags on the page.

links = soup.find_all("a")

It will find a division tag under < li >. Here ‘j’ is an iterable variable.

Statement = j.find("div",attrs={'class':'m-statement__quote'})

Text.strip() function will return text within this tag and strip any extra spaces.

Statement j.find("div", attrs={'class':'m- 
statement__quote'}).text.strip()

We have scraped our first attribute. In the same division, we will look for the anchor tag and return with the value of the hypertext link.

Link=j.find("div", attrs={'class':'m-statement__quote'}).find('a') ['href'].strip()

To get the Data attribute, we will inspect the web page first.

Date j.find('div', attrs={'class':'m- 
statement__body'}).find('footer').text[-14:-1].strip()
Source j.find('div', attrs={'class':'m- 
statement__author'}).find('a').get('title').strip()

Next, we are using ‘alt’ as an attribute to get()

Let’s combine all concepts and fetch details for five different attributes of my Dataset.

Making Dataset

frame.append([Statement, Link, Date, Source, Label])
upper frame.extend(frame)

Visualising Dataset

For visualizing, use pandas DataFrame.

Visualising-Dataset

Make CSV File & Save it to Your machine

Write a CSV file and save it to your machine

Make-CSV-File-&-Save-it-to-Your-machine Make-CSV-File-&-Save-it-to-Your-machine01

Complete Code

For more information, get in touch with iWeb Data Scraping now! You can also reach us for all your web scraping service and mobile app data scraping requirements.

Let’s Discuss Your Project