Prj701 4 ~ XML

August 29, 2018
PRJ701 Python XML Open source

Project 701 ~ XML (Extensive Markup Language)

I’m personally not a fan of XML as I feel it can get to complicated and cluttered, although I do come from a bias of JSON. Even so again this is what I have to learn, so learn I shall. XML to me is just HTML but with more tags so picking up XML shouldn’t be too difficult. For anything I normally look for documentation or blogs with examples. After a quick ddg search I found that a stackoverflow answer and code snippet which should be sufficient. Let’s make a quick example to solidify what we have just learned. For this example we are going to need an XML file to parse, which for me the first thing that came to mind was an rss feed. This is only because I had learned about rss feeds in my first year ITC501(IT in context?) when we did a blog on Aaron Swartz, although I didn’t actually know how to interact with rss feeds at the time because I didn’t know what XML was (or had learned html yet). The next question would be what sites produce rss feeds? To be honest I am not actually sure if there is a standard way of displaying rss feeds, but I did happen to know Reddit does rss feeds. After a quick ddg search I found out it was produced by simply adding .rss on the end of any Reddit url. I am going to use /r/python, for obvious reasons, which to produce the feed will look like this ~ https://reddit.com/r/python.rss. Click that link and you will understand the context of my first sentence.

The example doesn’t need to be complicated in any way, all I am wanting to do is to parse(extract) the posts of the subreddit. We are going to re-write(cause you should NEVER copy and paste) the code snippet which looks like this ~

from lxml import etree
import requests

def main():
    url = 'https://www.reddit.com/r/python.rss'
    headerz = {'User-Agent': 'rss-ingestor'}
    raw_rss_feed = requests.get(url, headerz).content
    parsed_rss = etree.fromstring(raw_rss_feed)
    for element in parsed_rss:
        print('Element: %s' % element)
        for subele in element:
            print('Subele: %s' % subele)
            if subele.text is not None:
                print(subele.text)

if __name__ == '__main__':
    main()

That’s pretty much all I am going to do at this point. Now I have a simple example that I can refer back to when I work more with XML and etree.

Web701 22

June 15, 2019
Web701 Serverless OpenFaas Docker CLI Python Digital Ocean

Web701 21

May 21, 2019
Web701 VirtualBox Serverless OpenFaas Docker CLI Python

Web701 20

May 20, 2019
Web701 Python Django Heroku Web Hosting
comments powered by Disqus