Project 701 ~ XML (Extensive Markup Language)
I’m personally not a fan of XML as I feel it can get to complicated and cluttered, although I do come from a bias of JSON. Even so again this is what I have to learn, so learn I shall. XML to me is just HTML but with more tags so picking up XML shouldn’t be too difficult. For anything I normally look for documentation or blogs with examples. After a quick ddg search I found that a stackoverflow answer and code snippet which should be sufficient. Let’s make a quick example to solidify what we have just learned. For this example we are going to need an XML file to parse, which for me the first thing that came to mind was an rss feed. This is only because I had learned about rss feeds in my first year ITC501(IT in context?) when we did a blog on Aaron Swartz, although I didn’t actually know how to interact with rss feeds at the time because I didn’t know what XML was (or had learned html yet). The next question would be what sites produce rss feeds? To be honest I am not actually sure if there is a standard way of displaying rss feeds, but I did happen to know Reddit does rss feeds. After a quick ddg search I found out it was produced by simply adding .rss on the end of any Reddit url. I am going to use /r/python, for obvious reasons, which to produce the feed will look like this ~ https://reddit.com/r/python.rss. Click that link and you will understand the context of my first sentence.
The example doesn’t need to be complicated in any way, all I am wanting to do is to parse(extract) the posts of the subreddit. We are going to re-write(cause you should NEVER copy and paste) the code snippet which looks like this ~
from lxml import etree
import requests
def main():
url = 'https://www.reddit.com/r/python.rss'
headerz = {'User-Agent': 'rss-ingestor'}
raw_rss_feed = requests.get(url, headerz).content
parsed_rss = etree.fromstring(raw_rss_feed)
for element in parsed_rss:
print('Element: %s' % element)
for subele in element:
print('Subele: %s' % subele)
if subele.text is not None:
print(subele.text)
if __name__ == '__main__':
main()
That’s pretty much all I am going to do at this point. Now I have a simple example that I can refer back to when I work more with XML and etree.