Tuesday, January 26, 2016

RSS reader in c#

 

What is RSS?

RSS (Rich Site Summary) is a format for delivering regularly changing web content. Many news-related sites, weblogs and other online publishers syndicate their content as an RSS Feed to whoever wants it.

Recently I got a requirement of reading rss feeds of blogger and other few sites and load posts in to db.

I followed the following steps.

1. Create Domain Classes

public class Post:Entity
  {
      public Post()
      {
      }

      public Post(long id)
      {
          Id = id;
      }
      public DateTime DatePublished { get; set; }
      public Blogger Blogger { get; set; }
      public string PostGUID { get; set; }
      public string Title { get; set; }
      public string PostURL { get; set; }
      public string Description { get; set; }
      public string ThumbnailURL { get; set; }
      public string PostCategories { get; set; }
      public decimal RankGiven { get; set; }

      public int  RecordCount { get; set; }

  }

2. Create Console App

A console app will first read all the blogger urls in the db then pass it to the rss feed parser.

Following is the code I used to read the RSS feed and load the content in to the Post class which I have defined previously.

    public class RSSFeedParser
       {

           public static List<Post> Parse(List<Blogger> bloggerList)
           {
               List<Post> posts = new List<Post>();
               foreach (Blogger blogger in bloggerList)
               {
                 
                   if (!string.IsNullOrEmpty(blogger.RSSFeedURL))
                   {
                       Console.WriteLine(string.Format("Started reading feeds for : {0}", blogger.BlogURL));
                       try
                       {
                           var rssFeed = XDocument.Load(blogger.RSSFeedURL);
                           XNamespace media = XNamespace.Get("http://search.yahoo.com/mrss/");
                           foreach (var item in rssFeed.Descendants("item"))
                           {
                               var elements = item.Elements();
                               Post post = new Post();
                               post.Blogger = blogger;
                               post.Title = item.Element("title").Value;
                               post.Description = GetTruncakedDescription(item.Element("description").Value);
                               post.DatePublished = Convert.ToDateTime(item.Element("pubDate").Value);
                               post.PostGUID = item.Element("guid").Value;
                               post.PostURL = item.Element("link").Value;
                               post.ThumbnailURL = item.Element(media + "thumbnail") != null ? item.Element(media + "thumbnail").Attribute("url").Value : null;
                               post.PostCategories = string.Join(",", item.Elements("category").Select(x => x.Value).ToList());
                               posts.Add(post);
                           }

                           Console.WriteLine(string.Format("Successfully parsed posts for  : {0}", blogger.BlogURL));
                       }

                       catch (Exception ex)
                       {
                           Console.WriteLine(string.Format("Error occured while parsing posts for  : {0}", blogger.BlogURL));
                       }
                   }
               }

               return posts;

           }

           private static string GetTruncakedDescription(string description)
           {
               string plainTextDesciption = Regex.Replace(description, @"<[^>]*(>|$)|&nbsp;|&zwnj;|&raquo;|&laquo;", string.Empty).Trim();
               string trucatedDescription = plainTextDesciption.Substring(0, plainTextDesciption.Length>400?400 : plainTextDesciption.Length);
               return trucatedDescription;
           }
       }

RSS feed is an XML document with set of defined nodes. You can see the full list of nodes which RSS standard has defined in the following site.

https://validator.w3.org/feed/docs/rss2.html

Here I have read the post name, url, thumbnail, categories and some relevant data I need for my app.

I have used a different method to parse the description field. It is to solve two issues.

1. Description field contains full post content which it too long. So I sub string the content get got the content length as required for my site requirement

2. In some RSS feeds description field contains the HTML file content. But I need to have the plain text. So I used regex to remove html tags from the field content to get the plain text.

Hope this helps.

Happy Coding

No comments :

Post a Comment