Trail-Dad GIS

The Long Tail of This.cm

2/14/2015

I joined This.cm a few weeks back and have been thrilled with the quality and diversity of content that is shared by its community. If you haven't heard about 'This,' it's a social media platform where users share one (and only one) link per day, presumably the best thing they had run across in the past 24 hours.

The premise works. Very, very well.

Great things to read are constantly surfaced, and at this point in the young site's growth, all of them are really convenient to browse. The golden ticket thus far has been the pairing of depth and breadth, those elusive partners of successfully curating content. Having wickedly smart and interesting folks in the mix doesn't hurt.

This.cm provides a welcome antidote to the paralysis I have felt while standing in the stream of every social media newsfeed or news aggregator. Cutting through the noise has been easier than ever, and is something that I hope the site continues to conquer as it grows.

With a long weekend ahead and feeling a bit under the weather, I figured it would be a fun exercise to take a peek at what has been the functional foundation of all the great content I have seen so far. To do so, I created a This.cm dataset which includes submission data from 11/1/2014 to 2/13/2015. The dataset includes URL submission, user name, high-level domain, number of thanks, and title. This extracted data was then cleaned up and loaded into a data model for quick analysis and future additions.

The easiest place to start is with the big numbers. During the sample's time frame, nearly 19,000 submissions have been posted on This.cm. Of these, almost 14,000 are unique and they originate from about 3,500 distinct domains. To see how these distinct submissions have been added to the site over time, here is a basic figure:

There is a definite trend to the rise of daily content being added to This.cm (yay!). Invites to the site are being sent out and articles are being written about it (here, here, here). I know I'm proselytizing within my own circles.

The valleys in the above figure are largely weekends. This is consistent with non-mobile sites, itself being one of the critiques that has been lobbed at This.cm (which is currently strongest in the desktop experience). To break out this relationship among days of the week, here is another figure that splits submissions and number of thanks submitted per weekday:

All of this is fine and dandy, but what really interested me was the sources behind the content being recommended as the best of the day. A little bit of normalization later to determine domain-level usage, and we have a quick Pareto diagram that shows both the extremely long-tail of distinct domains as well as the smaller percentage of domains that make up the bulk of content (saving Zipf's Law for another day...).

20% of This.cm submissions comes from six sources. In essence, 1 out of 5 links you see will be from one of these big players:

Moving up to 40% of the submissions on the site, the pool of domains expands by 31:

We can already see the distribution tail lengthening, but going up to 60% of submissions really shows the beginning of source diversity (and phenomenal usage of typography in branding).

While I would love to visualize the next two buckets of domains in 20% cumulative usage, the long tail becomes overwhelming. One needs to add 637 domains to all of those above for 80% of submissions.

Here is the whammy of the long tail and quite possibly my favorite part of the This.cm experience: The final 20% is made up of 2,762 distinct domains. I love knowing that 4 out of 5 links I see on the site are likely from familiar domains, but that 5th link is likely from a shadowy corner of the internet with the bonus that it was personally recommended by a human being. Many of the most engaging and thoughtful pieces that I've been linked to from This.cm have come from the long tail, and I hope that the trend continues.

The problem with starting a little analysis like this is that I want to just keep plugging away, especially considering how enamored I am with what the users of the site have to offer and what a petri dish the social network is during its infancy and growth phases. I'll throw out a few more quick glances at things and then it's time to go hiking.

I don't have the time or energy to bust out complementary cumulated distribution plots around word usage in article titles or descriptive comments, but here is a simple word cloud of the titles posted on This.cm thus far in the sample I chose:

And the same for commentary by users describing their recommendations:

The last thing I started to look at was the relationship between "thanks" that users give to distinct recommendations and the source domains. Borrowing from the exciting world of materials management, I employed a quick variant of ABC analysis to show where 20% of titles account for 80% of thanks across the whole site, and how that plays out per domain (seen as "A," or the blue parts of the bars in the figure below). The figure just shows some of the most popular domains, but there is definitely something between the ratio of A:B that highlights viral recommendations.

The fun part of an initial analysis is that every viewpoint leads to more questions, and I'm happy to have the data modeled in a way that this may be explored more heavily down the road. One question I'd love to research is the impact of content providers joining This.cm - I noticed that when The New Republic came on board, submissions from the their domain exploded upon their arrival,

Network analysis would be fascinating to see the impact of thought-leaders as certain nodes. Tying the long tail of domains to metadata that summarizes their content (news, sociology, science, entertainment, etc.) and watching temporal trends would be a blast.

In any case, I love what This.cm provides thus far and it has been a pleasure participating in its offerings, being great things to read or fun data to play with.

2 Comments

Playing with Mount Hood in 3D

11/20/2013

2 Comments

I hope to actually post a workflow of what I'm doing here at a later time. That, and I'll probably drop a few expletives with regard to projections. For right now, just a quick overview.

I have a major habit of diving into the most random GIS (Geographic Information System) projects, usually sometime around midnight. By then, the kids are in bed, I've spent some quality time with my wife, and the ol' noggin is starting to relax. While this does nothing favorable for early wake-up calls the next morning, it does provide a fun outlet to explore the lands that interest me the most.

I never considered that I'd be one to blog about these projects, but because so many of them border on the adventures I try to chronicle on this site, it does feel right.

My last big project came on the heels of reading about the wonderful work done on landscape visualization by Bjørn Sandvik over at his thematic mapping blog. I began by playing around with a few DEMs I had laying on the hard drive, namely of the quadrangles around Bend and the area of Mount Hood.

-- Intermission for technical venting:

Can I just say how much I regretted deleting my Ubuntu installation a few months back? While totally doable, I had one helluva time getting Windows 7 to play nicely with the different requirements of GDAL, Mapnik, Python, Perl, IIS, etc. In any case, things are happy on this Microsoft box for the moment...

After playing around with the pretty amazing three.js functionality and the great tutorials laid out by Bjørn, I began cranking out some rough landscape visualizations of Bend, Oregon up through the Central Cascades. The scaling was horrible out of the gate, and was tempered a bit as seen in the framed image above. A little more adjusting, and the addition of a crappy land cover did the trick to at least bring the area to life.

While aesthetically there was much to be desired, the drainage systems of the area were well represented. Furthermore, you could actually note Jefferson down to North and Middle Sisters, as well as Tam MacArthur Rim. The bonus was seeing the little valley that the Metolius flows northward through.

If you want a closer look, check it out here or click the respective pictures. A word of warning: If you're not using a WebGL-enabled browser, you may be bored and not see any of the pretty stuff. Newer versions of Chrome work nicely, Firefox is hit-and-miss, Safari typically needs to have WebGL enabled manually, and IE just doesn't work. A quick little browser check can be found at http://get.webgl.org.

On a side note, I'm hosting this via my Google Drive account. Weebly, who hosts this blog, does not allow some of the JavaScript madness contained therein. I tried, and maybe there is a way, but none of my screwing around made it work. One day I'll host this on an adult server and get to play for real!

I used another DEM to do the same with the area around Mount Hood, up past the Columbia River, and east toward the Dalles. I was much happier with the terrain coloration and some fun experimentation with Mapnik's symbolizer classes yielded interesting river effects and points of interest (All done by hopping around in QGIS, where I put together a few shapefiles that distinctly attributed the campgrounds and trailheads of Mount Hood National Forest. A little hydrology on top and then the XML merging in Mapnik took shape).

The problem with this visualization, just like the Bend one before it, was largely due to the fact that the resolution sucked. These were DEMs that covered a large area, with resolutions that were less than optimal (all details found on individual pages under the citation).

I opted for FTP'ing directly to the Oregon Geospatial libraries, and found the best DEM's possible without spending any money. To give it a shot, I kept with Mount Hood and as can be seen in the image to the right, now some detail was being seen. Hell, I could discern the spot on the Muddy Fork where I broke my foot a few months back!

Playing around with some more shapefile creation and Mapnik magic, I was able to lay out a pretty awesome visualization of Mount Hood that is glacially accurate (?), complete with lakes, trailheads, campgrounds, AND trails (!!!). If you zoom in enough, you'll see the trails dotted all around that majestic mountain.

I still have a lot to work planned, and once I get a bit of time, here's what I'm really hoping to accomplish:

In lieu of terrain cover, drape a topographic map mosaic on top of the wire frame.
Experiment with D3 to see if I can get tooltips or labels to work.
Work more on the fog functionality; I got it to work after looking at a bunch of examples, but never felt happy with it. I think the limited area being shown right now is the definite problem. That said, fog generated by an exponential squared function is pretty damn cool and lifelike.
Document and automate much of the process.
Expand the coverage area to showcase the hikes with my son and daughter.

Around the last point, it's been a hoot watching the kids play with these visualizations. Every time I leave them up on the screen, they grab the mouse and fly around the area.

As for documenting things, I'll try to do that here. On the evening I sat down to do that, I froze the computer doing an obscene raster merge and lost the damn file being compiled with my complete workflow (that I hadn't saved).

2 Comments

The Long Tail of This.cm

Playing with Mount Hood in 3D

Spencer Haley