RSS reader in c#

 

What is RSS?

RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it.

Recently I got a requirement of reading rss feeds of blogger and other few sites and load posts in to db.

I followed the following steps.

1. Create Domain Classes

public class Post:Entity
  {
      public Post()
      {
      }

      public Post(long id)
      {
          Id = id;
      }
      public DateTime DatePublished { get; set; }
      public Blogger Blogger { get; set; }
      public string PostGUID { get; set; }
      public string Title { get; set; }
      public string PostURL { get; set; }
      public string Description { get; set; }
      public string ThumbnailURL { get; set; }
      public string PostCategories { get; set; }
      public decimal RankGiven { get; set; }

      public int  RecordCount { get; set; }

  }

2. Create Console App

A console app will first read all the blogger urls in the db then pass it to the rss feed parser.

Following is the code I used to read the RSS feed and load the content in to the Post class which I have defined previously.

    public class RSSFeedParser
       {

           public static List<Post> Parse(List<Blogger> bloggerList)
           {
               List<Post> posts = new List<Post>();
               foreach (Blogger blogger in bloggerList)
               {
                 
                   if (!string.IsNullOrEmpty(blogger.RSSFeedURL))
                   {
                       Console.WriteLine(string.Format("Started reading feeds for : {0}", blogger.BlogURL));
                       try
                       {
                           var rssFeed = XDocument.Load(blogger.RSSFeedURL);
                           XNamespace media = XNamespace.Get("http://search.yahoo.com/mrss/");
                           foreach (var item in rssFeed.Descendants("item"))
                           {
                               var elements = item.Elements();
                               Post post = new Post();
                               post.Blogger = blogger;
                               post.Title = item.Element("title").Value;
                               post.Description = GetTruncakedDescription(item.Element("description").Value);
                               post.DatePublished = Convert.ToDateTime(item.Element("pubDate").Value);
                               post.PostGUID = item.Element("guid").Value;
                               post.PostURL = item.Element("link").Value;
                               post.ThumbnailURL = item.Element(media + "thumbnail") != null ? item.Element(media + "thumbnail").Attribute("url").Value : null;
                               post.PostCategories = string.Join(",", item.Elements("category").Select(x => x.Value).ToList());
                               posts.Add(post);
                           }

                           Console.WriteLine(string.Format("Successfully parsed posts for  : {0}", blogger.BlogURL));
                       }

                       catch (Exception ex)
                       {
                           Console.WriteLine(string.Format("Error occured while parsing posts for  : {0}", blogger.BlogURL));
                       }
                   }
               }

               return posts;

           }

           private static string GetTruncakedDescription(string description)
           {
               string plainTextDesciption = Regex.Replace(description, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();
               string trucatedDescription = plainTextDesciption.Substring(0, plainTextDesciption.Length>400?400 : plainTextDesciption.Length);
               return trucatedDescription;
           }
       }

RSS feed is an XML document with set of defined nodes. You can see the full list of nodes which RSS standard has defined in the following site.

https://validator.w3.org/feed/docs/rss2.html

Here I have read the post name, url, thumbnail, categories and some relevant data I need for my app.

I have used a different method to parse the description field. It is to solve two issues.

1. Description field contains full post content which it too long. So I sub string the content get got the content length as required for my site requirement

2. In some RSS feeds description field contains the HTML file content. But I need to have the plain text. So I used regex to remove html tags from the field content to get the plain text.

Hope this helps.

Happy Coding

Comments

Popular posts from this blog

Responsive Web Design

Contract First Development in WCF 4.5

Affine Cipher in C#