Https Preventing Website Scraping In Python3

December 22, 2023 Post a Comment

I am trying to scrap a website using Python code, following a tutorial, however the website has since been secured with 'https' and when running the code it returns the below error

Solution 1:

Can you try adding this to your code? This should bypass ssl verification.

import ssl
ssl._create_default_https_context = ssl._create_unverified_context

Solution 2:

The proble here is that URL has anti-scraping protections in place, which resist programmatic HTML extraction

Try requests to get full info

import requests 
from bs4 import BeautifulSoup

#specify the url
quote_page = 'https://www.bloomberg.com/quote/SPX:IND'
result = requests.get(quote_page)
print (result.headers)
#parse the html using beautiful soup and store in variable `soup`
c = result.content
soup = BeautifulSoup(c,"lxml")

print (soup)

Output

{'Cache-Control': 'private, no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html, text/html; charset=utf-8', 'ETag': 'W/"5bae6ca0-97f"', 'Last-Modified': 'Fri, 28 Sep 2018 18:02:08 GMT', 'Server': 'nginx', 'Accept-Ranges': 'bytes, bytes', 'Age': '0, 0', 'Content-Length': '1174', 'Date': 'Sat, 29 Sep 2018 17:03:02 GMT', 'Via': '1.1 varnish', 'Connection': 'keep-alive', 'X-Served-By': 'cache-fra19128-FRA', 'X-Cache': 'MISS', 'X-Cache-Hits': '0', 'X-Timer': 'S1538240583.834133,VS0,VE107', 'Vary': ', Accept-Encoding'}
<html>
<head><title>Terms of Service Violation</title><styletype="text/css">.container {
            font-family: Helvetica, Arial, sans-serif;
        }
    </style><script>window._pxAppId = "PX8FCGYgk4";
        window._pxJsClientSrc = "/8FCGYgk4/init.js";
        window._pxFirstPartyEnabled = true;
        window._pxHostUrl = "/8FCGYgk4/xhr";
        window._pxreCaptchaTheme = "light";

        functionqs(name) {
            var search = window.location.search;
            var rx = newRegExp("[?&]" + name + "(=([^&#]*)|&|#|$)");
            var match = rx.exec(search);
            return match ? decodeURIComponent(match[2].replace(/\+/g, " ")) : null;
        }
    </script></head><body><divclass="container"><imgsrc="https://www.bloomberg.com/graphics/assets/img/BB-Logo-2line.svg"style="margin-bottom: 40px;"width="310"/><h1class="text-center"style="margin: 0 auto;">Terms of Service Violation</h1><p>Your usage has been flagged as a violation of our <ahref="http://www.bloomberg.com/tos"target="_blank">terms of service</a>.
    </p><p>
        For inquiries related to this message please <ahref="http://www.bloomberg.com/feedback">contact support</a>.
        For sales
        inquiries, please visit <ahref="http://www.bloomberg.com/professional/request-demo">http://www.bloomberg.com/professional/request-demo</a></p><h3style="margin: 0 auto;">
        If you believe this to be in error, please confirm below that you are not a robot by clicking "I'm not a robot"
        below.</h3><br/><divid="px-captcha"style="width: 310px"></div><br/><h3style="margin: 0 auto;">Please make sure your browser supports JavaScript and cookies and
        that you are not blocking them from loading. For more information you can review the Terms of Service and Cookie
        Policy.</h3><br/><h3id="block_uuid"style="margin: 0 auto; color: #C00;">Block reference ID: </h3><scriptsrc="/8FCGYgk4/captcha/captcha.js?a=c&amp;m=0"></script><scripttype="text/javascript">document.getElementById("block_uuid").innerText = "Block reference ID: " + qs("uuid");</script></div></body>
</html>

By the way,if you are student you can sign up for limited account,in terms of downloads.

lacucinadiadine

Https Preventing Website Scraping In Python3

Solution 1:

Solution 2:

Post a Comment for "Https Preventing Website Scraping In Python3"

Widget HTML #3