RSS reader in c#
What is RSS?
RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it.
Recently I got a requirement of reading rss feeds of blogger and other few sites and load posts in to db.
I followed the following steps.
1. Create Domain Classes
public class Post:Entity
{
public Post()
{
}
public Post(long id)
{
Id = id;
}
public DateTime DatePublished { get; set; }
public Blogger Blogger { get; set; }
public string PostGUID { get; set; }
public string Title { get; set; }
public string PostURL { get; set; }
public string Description { get; set; }
public string ThumbnailURL { get; set; }
public string PostCategories { get; set; }
public decimal RankGiven { get; set; }
public int RecordCount { get; set; }
}
2. Create Console App
A console app will first read all the blogger urls in the db then pass it to the rss feed parser.
Following is the code I used to read the RSS feed and load the content in to the Post class which I have defined previously.
public class RSSFeedParser
{
public static List<Post> Parse(List<Blogger> bloggerList)
{
List<Post> posts = new List<Post>();
foreach (Blogger blogger in bloggerList)
{
if (!string.IsNullOrEmpty(blogger.RSSFeedURL))
{
Console.WriteLine(string.Format("Started reading feeds for : {0}", blogger.BlogURL));
try
{
var rssFeed = XDocument.Load(blogger.RSSFeedURL);
XNamespace media = XNamespace.Get("http://search.yahoo.com/mrss/");
foreach (var item in rssFeed.Descendants("item"))
{
var elements = item.Elements();
Post post = new Post();
post.Blogger = blogger;
post.Title = item.Element("title").Value;
post.Description = GetTruncakedDescription(item.Element("description").Value);
post.DatePublished = Convert.ToDateTime(item.Element("pubDate").Value);
post.PostGUID = item.Element("guid").Value;
post.PostURL = item.Element("link").Value;
post.ThumbnailURL = item.Element(media + "thumbnail") != null ? item.Element(media + "thumbnail").Attribute("url").Value : null;
post.PostCategories = string.Join(",", item.Elements("category").Select(x => x.Value).ToList());
posts.Add(post);
}
Console.WriteLine(string.Format("Successfully parsed posts for : {0}", blogger.BlogURL));
}
catch (Exception ex)
{
Console.WriteLine(string.Format("Error occured while parsing posts for : {0}", blogger.BlogURL));
}
}
}
return posts;
}
private static string GetTruncakedDescription(string description)
{
string plainTextDesciption = Regex.Replace(description, @"<[^>]*(>|$)| |‌|»|«", string.Empty).Trim();
string trucatedDescription = plainTextDesciption.Substring(0, plainTextDesciption.Length>400?400 : plainTextDesciption.Length);
return trucatedDescription;
}
}
RSS feed is an XML document with set of defined nodes. You can see the full list of nodes which RSS standard has defined in the following site.
https://validator.w3.org/feed/docs/rss2.html
Here I have read the post name, url, thumbnail, categories and some relevant data I need for my app.
I have used a different method to parse the description field. It is to solve two issues.
1. Description field contains full post content which it too long. So I sub string the content get got the content length as required for my site requirement
2. In some RSS feeds description field contains the HTML file content. But I need to have the plain text. So I used regex to remove html tags from the field content to get the plain text.
Hope this helps.
Happy Coding
Comments
Post a Comment