Tags from post content widget?

Topics: Controls
May 8, 2012 at 2:16 AM

Is there any consideration for a widget that strips out articles, prepositions, pronouns, etc, from Post text to create a list of unique/most common words that can then be used to populate the Tags field at creation time? I'm creating a site that queries Wikipedia, Flickr, Google, YouTube, etc. dynamically based on tag/keywords from the main post; an ability to auto generate a Top X or some other list would be a huge time saver.

 

Thanks

 

May 8, 2012 at 1:54 PM
Edited May 8, 2012 at 5:01 PM

That could be a quite an interesting exercise, possibly quite time consuming but also quite useful. First thoughts, you would need a lexicon (in your chosen language) containing all the "prepositions, pronouns, etc" as a lookup table of terms not to consider for tagging. Out of the remaining candidate words it would be fairly straight forward to select the most frequent terms, the tricky part might be in defining what the criteria would be for "unique" words and how to handle that e.g. word variations: quick and quickly, mouse and mice, define and defining etc - perhaps there could be some manual input to select the desired word variation.

May 11, 2012 at 5:09 PM

Good points. It would be possible to create a quite elegant word matching algorithm using content analytics, but even just a simple parser with a basic lookup table would be useful for automated blog posting, for example tweeting a blog entry and having new entry be auto tagged. 

May 11, 2012 at 7:15 PM

Oddly enough, I have been playing around with this today on a test rig.

I was curious to see just what kind of results you would get by removing prepositions, pronouns and conjunctions, stemming the words (to handle word variations) and picking out the most frequent.

Here's some test text:

Tablets Driving Mobile TV Viewing

A report from leading analyst firm Juniper Research has revealed that Mobile TV viewing is being driven by tablet computers. With more mediums for mobile TV viewing than ever before it has long been predicted that significant mobile TV viewership was an inevitability. What has been less certain is how consumers would choose to enjoy the ‘anytime, anywhere’ phenomenon. Juniper suggests that by 2014 mobile TV viewing will reach an average of 3 hours per month on tablets. Mobile content and applications specialist and report author Charlotte Miller believes it is only a matter of time before mobile TV is embedded in everyday social culture: Consumers are already accustomed to timeshifting thanks to PVRs such as TiVo and Sky+; what mobile TV allows them to do is placeshift. This allows users to watch their pay-TV content anytime, anywhere and on any device – the TV experience is no longer confined to the home. ”The appeal of tablet next to other portable devices is palpable in relation to mobile TV viewing. With modest yet engaging screen size tablets allow users to explore content (revealing features such as plot synopses and character biographies) giving a truly interactive experience. Juniper also stated that although streamed mobile TV viewing on smartphones was set to soar to 240 million by 2014, companies should be focussing on penetrating time-critical broadcasting as users are most likely to absorb breaking news on their smartphone. 

Here's the results:

Top Ten Key words

mobile (10), tv (10), viewing (6), tablets (5), juniper (3), content (3), allows (3), users (3), report (2), consumers (2)

They are actually not too bad.

Suggest tags

One line of thought was to do this in JavaScript, add a button to the post editor that when clicked would present you with this list as a pop up, from which you select the best words or combination of words - then hit the done button to add as tags. (Giving you the option to edit, but still quick).

Auto tag

For full automation, say as part of a "post saving" event, then there might be a way to do this without the content analysis (I fear my attempts at that might not be so elegant). Have a look at the post title:

Tablets Driving Mobile TV Viewing

If the keywords are in the title (and they should be) grab the consecutive matches as a single tag i.e.

Tablets (1 keyword), Mobile + TV + Viewing (3 keywords as one), that gives you two pretty damned good tags, the only other keyword worth having would be juniper (still need to think about basis for accurate selection of that or any others).

I'm leaning towards some kind of manual selection, but made quick and easy (click, click, click sort of thing).

  



May 11, 2012 at 9:32 PM

Absolutely, those are great examples. Even the automated Top 10 results seem like a valid and useful list of tags, which makes sense since key words would occur with a high frequency due to SEO considerations. The Suggest Tags button makes sense, and could even query the existing Tags table to increase word score, as well as search the Title as you mention. Having a simple 1, 2, 3 click Save process as you say seems quite practical.

I've been considering an inline jQuery New Post feature from the home page, as well as tweeting new entries, and auto tagging would greatly assist the overall process, would simplify data entry in any scenario I'd imagine. 

Thanks much for the effort. 

May 11, 2012 at 11:19 PM

Did some more research on the topic and came across a very nice example on keyword extraction with code.

The site is tsJenson - In pursuit of .NET excellence, which incidentally is running on BlogEngine.

Downloaded the code, tested it and it works really well, still have the issue of what to select from the results, but nice.

I will study (emphasis on study) the analyser code for ideas, should have some time through the week to do something with it.

I'd like to have a crack at a fully automated and editable version, I'll give you a shout when I have something.

May 12, 2012 at 1:34 AM

Great post. Having worked with commercial content analytics products I'm aware of how prohibitively expensive those technologies can be, but just to have even a rudimentary capability in any blog engine seems truly valuable. Functionality like this would open up several additional uses, such as navigational and reporting capabilities based on dynamic tag clouds, for example generating Top X keywords based on posts that fall within a specified date range selection. For now obviously even just a simple keyword parser on the Add/Edit page would be an excellent feature. 

May 23, 2012 at 1:12 PM
Edited May 23, 2012 at 1:50 PM

Have a first working version of auto suggest tagger.

Screen shots below will be available for 3 days.

The link to open as it might appear in the post editor

Opens in popup

Predictive tag suggestions as you type

The test post uses an excerpt from Wikipedia on the original Batman TV series, 5000 words long.

Suggest window pops up instantly.

Currently, the suggested words and phrases are text, but they could be presented as click to add links.

The disadvantage of click to add links being that the word presented might be something like "broadcasting" and the tag you want is "broadcast", you would then have to edit the word after insertion (unless you present all word variations - which is possibly a bit busy).

I used a stripped down version of Tyler's code, but the results are pretty decent.

The other option would be to use the Alchemy .NET API, it's pretty sophisticated - although I haven't tried to implement this yet

Thoughts?

PS

The predictive tag as you type works of existing tags, but the suggestions could be added to this list (temporarily in memory) so that they also appear as you type.

Another PS

Just been flipping through a number of past posts and running the suggester over them, in the majority of cases, the suggestions actually match the tags that I had already used - so I probably won't bother looking at the Alchemy API

May 24, 2012 at 4:09 PM

Andy,

Looks like you are working on another cool project.  Thanks for posting the link to tsJenson's blog.  Looks like interesting stuff.  I can't wait to see how this project takes shape.

-Ron

May 24, 2012 at 6:04 PM

Ron, 

Good to hear from you, I've sent you an email with log in details so that you can have a poke around.

A little tidying up to do, but any kind of feedback is always a boon.

Cheers

May 24, 2012 at 11:21 PM

Great stuff.

 

The placement of the suggest tags link on Add Entry seems an unobtrusive and logical place for the feature, no questions there. (aside from how your categories are nested, is that perhaps a feature in 2.6?)

Regarding the second capture, the inclusion of a Phrases option is a great feature, question though - are each of the pipe divided Words and Phrases intended to be clickable "add to list" links? Couldn't tell from the screenshot, but that would be an excellent workflow option. 

As for the Auto Suggest, it would be a great feature, my only concern would be performance. Suggesting off the existing Tags table shouldn't cause issues, but were you intending for real time suggest off the post content as well? If can be disabled then no worries, just curious. 

 

A larger question I have regards overall implementation, I understand an ASCX makes sense for widgetization, but were you intending for any public methods to be available for usage outside of the Add Entry page? There are client side uses that could absolutely benefit from the calling of a generate tags from post method, would be a chore to have to postback or jump through client side hoops to return the server side HTML. 

 

And I agree with previous poster, thanks a lot for the links to all the other content analysis stuff, I've also been playing with tsJenson's Keywords project, and the Alchemy stuff looks fascinating as well, glad to know there are some open source products out there regarding content analysis, not just the prohibitively expensive tools I'm more familiar with. The combination of these technologies in the blogging/CMS world seems a very logical union, could definitely help push this platform to the next level. 

 

 

May 25, 2012 at 12:32 AM
Edited May 25, 2012 at 12:51 PM

Still on BE 2 here for the time being, I did a bit of work on the categories, they are set up to behave more like WordPress categories.

Yes, currently the text between the pipes are not links, but styled that way with a view to making them so. There is one drawback with that though, the word variations.

Take for example the word broadcast, it can appear as broadcasts, broadcasters and broadcasting - you might be consistently using the term broadcasters when manually tagging (or if you only want to select part of a suggested phrase) then the quick click could screw that up a little - however, as regards workflow, the predictive tag as you type is very quick in bringing up options from existing tags - so if you start typing "bro" any words starting with that show, so your preferred term of "broadcasters" shows and is click to enter from the drop down list. I'm glad you mention it and I've actually given that allot of thought, current thinking being if an exact match exists, then that appears as click to add link with the others remaining as text only.

The jury is still out on tag suggestions for the current post being part of the predictive tag typing, but I think it could be added with little impact on performance (instantaneous right now), again, pros and cons, it might confuse matters, how do you then know what's already a tag (mind you, the candidate terms could appear a different colour or something like that in the drop down to differentiate them).

As things stand with overall implementation, I've just compiled everything into the source for now, it was very convenient to do so, also been hooking into existing methods of the Add_entry.aspx.cs. The Add_entry already implements the ICallbackEventHandler, so things happen as the result of a client side call and as a partial post back anyway. The methods you would be interested in are public static, so I suppose you could do other stuff, hadn't given it much thought till you mentioned it.

Tyler's site, that was a good discovery, lot's of goodies on that, got it bookmarked.

I think probably the best thing would be for you to try this out, the site is live so be careful not to save any posts and burn the log in details after reading - although I'll trash them in a couple of days anyway. I'm sending what you'll need now via your CodePlex contact details.

Great feedback, feel free to let us know how you get on, or any ideas.

Edit - Further thought -

Instead of presenting the suggestions as links, how about just highlighting any suggested word, phrase (or part of phrase) that matches with any variation of itself in existing tags?

Then as you start to type your chosen term in the tag entry box, the exact match presents itself.

May 25, 2012 at 4:24 PM
Edited May 25, 2012 at 4:32 PM

I believe suffix handling as you mention is going to be an issue in any scenario, can't imagine the UI being able to accommodate every variation, for example a dropdown containing the various possible suffixes (ing, er, ive, etc) for any word stem, seems that would be something for the code base algorithms to try to handle before being able to address at the UI. The add if exact match with the rest being plain text behavior seems sufficient to me, certainly for a version 1.0 at least. 

Perhaps existing tags, and all their variations, should optionally never be displayed as suggested words since the tagging process is mainly concerned with new tags as opposed to variations? This would help increase availability of other candidate words since in a Top X lookup having both broadcasting and broadcasters exist as new suggestions, when broadcast already exists in the tags table, would reduce the number of potentially new words from the Top X count. 

I suppose this depends on use cases, meaning are the majority of people ever going to want broadcasters AND broadcasting AND broadcasts? Would seem quite cluttered to include multiple suffix variations for every word, and don't know that automation could determine best suffix fit (SEO wise) based on word stem, ie auto suppose a pronoun (er), verb (ing), or adjective (ive, able) ending. 

The highlight matching of suggested tags with existing as you mention would help eliminate duplication and speed up overall workflow, I don't see a drawback to that, some manual control is going to be required at some point for refinement, doubt anyone would let the system auto generate tags without review, even if is technically possible.

I tested a few random posts on your site (nothing saved), and in only one case did I find the matches to be somewhat curious, but I believe that's just the nature of the content I was pasting in; technical, news, and other dry type of content generated great keywords, as a base, but a single somewhat humorous post gave some results which might seem funny at first, though this makes sense considering abstraction and irony would cause a bit of confusion with any semantic library, so no actual concerns there.

Nice to see the suggest feature is instantaneous, hard to imagine a scenario where a tags table lookup would be slow, I don't imagine a blogging/CMS system containing millions of tag rows. 

I was thinking perhaps in the second screenshot, a user might want to combine candidate words to form a new phrase, for example "Batman episodes" doesn't exist in the suggested phrases, but checkboxes next to each suggested word could allow a three click process of Batman + episodes => Add, meaning dynamic user phrase generation from suggested words, in addition to the automated suggestions. Again, this isn't required for this version, but maybe something to think about for upcoming functionality. 

I'm currently sandboxing an application that would make instant use of this feature, both on Add entry and on the client side, I'll try to get it tidied up a bit if you'd care to take a look, might help think up additional use cases, there are a couple that immediately come to mind.

May 25, 2012 at 6:20 PM
Edited May 25, 2012 at 8:17 PM

Thanks for all that, food for thought.

I probably should have mentioned that I'm using a related posts widget that relies largely on tag matches between posts, which is why I've been placing quite a bit of emphasis on comparison with existing tags and consistency of word variations.

There's quite a bit of latitude with the word variations, I had been playing with a version that stores all the words as objects containing the stem and corresponding word variations, so you could compare on stems and pluck out what you want from that (although that's not quite how this version works).

It would be difficult to imagine completely automated tagging as you say, considering language semantics and tagging can be quite objective, with many people tagging by concept, but if you can get something that offers reasonable suggestions, that has to be a good thing. On running this over a sample of existing posts I was getting a pretty good match rate with suggestions to actual tags and in some cases indicating that previous tagging could have been better. So at the very least, it can act as a guide.

The words along with the phrases give a pretty good overview, they compliment each other quite nicely and what you suggest about combining makes sense, mixing and matching between words and phrases and taking out what you want would still keep things simple if implemented as click, click add. I do like the idea of that, but for the sake of the related post tagging would still like some form of check against existing tags. 

It would be great to have a look at the applications you have in mind, I'm sure that would help round things out and give better perspective.

PS

On reflection, I don't think I explained the broadcast, broadcasters and broadcasting point terribly well, and it's probably worth clarifying.

I use broadcast as a blanket term for a technology type, broadcast vs streaming/IPTV

Broadacsters as a blanket term for organisations such as the BBC, HBO etc.

So I'm thinking there are probably quite a few situations where a term variation can take on significantly different meaning.

Where that's the case, only existing variations would show in the drop down and if you want some other variation, then you just ignore the suggestion and type what you want.

May 25, 2012 at 10:34 PM

I see your point about semantic concepts as opposed to words, broadcast is a good example, you're right that there are several similar words which can be a mixture of pronoun, adjective and verb, so tying the functionality to related posts and existing tags would make sense for most blogs, since most are probably themed, as yours is, and would want that type of consistency. My own site is a mishmash of many things but doubt that's the norm, so I think your approach will work just fine, and either way, I'd think every site would benefit from having similar posts also likewise tagged, would assist the standardization of both search and tag cloud functionality, so makes sense to me. And as mentioned previously, there will be many ways of cleverly using highlighting, italics, etc, to note difference between existing tags and variations.

I agree that fully trusting automated tagging wouldn't generally be very practical, but I'd only use it as the first run in an iterative tag review process, which in certain cases might generate more validity than others, and would allow for instant functional blog content from an email/tweet, I'll try to send a link this weekend to demonstrate how I'm currently using tags, but again, this auto tagging will probably mostly be useful for social enterprise/editing/research applications, or CMS type sites, not the majority of blogs out there. 

It's great stuff, especially considering the short amount of time you've gotten something going, a version 1.0 I'm sure will be eagerly accepted by the community at large.