Posts Tagged ‘graph’

e10.6 community detection for my twitter network

Sunday, April 4th, 2010

last night i applied my network decomposition algorithm to a graph of some of the people near me in twitter.

first i build a friend graph for 100 people ‘around’ me (taken from a crawl i did last year). by ‘friend’ i mean that if alice follows bob then bob also follows alice.

here the graph, some things to note though; it was an unfinished crawl (can a crawl of twitter EVER be finished) and was done october last year so is a bit out of date.

friends (more…)

e10.5 revisiting community detection

Tuesday, March 30th, 2010

i’ve decided to switch back to some previous work i did on community detection in (social) graphs

the last chunk of code i wrote which tried to deal with weighted directed graphs was terribly, terribly, broken but it seems that simplifying to undirected graphs is giving me much saner results. yay!

here’s an example of my work in progress generated from the new version of the code

consider the graph

p97

and it’s corresponding decomposition

p97.dendrogram

the results are reasonable; the initial breaking of clusters [1,2,3,4,5,6] and [7,8,9,10,11,12] is the most obvious but some of the others are not as intuitive

[1,2,5] and [7,8,10] remain as unbreakable cliques though it’s arbitrary that 11 was broken off from [7,8,10] instead of 10 (arbitrary but an artifact related to my shortest path calculation for the edge betweenness)

the idea of identifying the edge to remove using edge betweenness works well but it is often the case there are many edges with the same maximal betweeness and you have to choose only one. i think my current implementation of picking one is a bit naive and i’m not sure if i should move to a stochastic / monte carlo style approach or focus more on modularity maximisation

e10.4 communities in social graphs

Tuesday, October 6th, 2009

social graphs, like twitter or facebook, often follow the pattern of having clusters of highly connected components with an occasional edge joining these clusters.

these connecting edges define the boundaries of communities in the social network and can be identified by algorithms that measure betweenness.

the girvan-newman algorithm can be used to decompose a graph hierarchically based on successive removal of the edges with the highest betweenness.

the algorithm is basically

  1. calculate the betweenness of each edge (using an all shortest paths algorithm)
  2. remove the edge(s) with the highest betweenness
  3. check for connected components (using tarjan’s algorithm)
  4. repeat for graph or subgraphs if graph was split

e10.2 tgraph crawl order example

Monday, September 21st, 2009

let’s consider an example of the crawl order for tgraph…

we seed our frontier with ‘a’ and bootstrap cost of 0.

fetching the info for ‘a’ shows 2 outedges to ‘b’ and ‘c’, from our cost formula these all have cost 0 + 1 + Log10(2+1) = 1.6

our frontier becomes [ {b,1.6}, {c,1.6} ]

next is ‘b’ and see it has an outdegree of 3, these nodes, b1 -> b3, all have a cost of 1.6 + 1 + Log10(3+1) = 3.2

our frontier becomes [ {c,1.6}, {b1,3.2}, {b2,3.2}, {b3,3.2} ]

next is ‘c’ with an outdegree of 15. these 15 nodes, c1 -> c15, have cost 1.6 + 1 + Log10(16) = 3.8

our frontier is now [ {b1,3.2}, {b2,3.2}, {b3,3.2}, {c1,3.8} ... {c15,3.8} ]

we would then continue with the ‘b’ nodes before the ‘c’ ones

note that this cost system is more than just an ordering, it allows an element at depth n+1 to be checked an element at depth n, if the cost of the latter is high enough.

e10.1 crawling twitter

Saturday, September 19th, 2009

our first goal is to get some data and the twitter api makes getting the data trivial. i’m focused mainly on the friends stuff but because it only gives user ids i’ll also get the user info so i can put names to ids.

a depth first crawl makes no sense for this one experiment, we’re unlikely to get the entire graph and are more interested in following edges “close” to me. instead we’ll use a breadth first search.

since any call to twitter is expensive (in time that is, they rate limit their api calls) instead of a plain vanilla breadth first we’ll introduce a cost component to elements on the frontier so help decide what to grab next. this is especially important for a graph like twitter where the outdegree of a node is often in the hundreds. it turns the crawl into something that is not strictly depth first but it works out.

to explain the cost component consider the expected connectivity of nodes in the twitter friend graph. most nodes have an outdegree of the order 20-200. occasionally you see much larger (in the 1000’s) or much smaller (under 10).

we might, naively perhaps, say that having a large outdegree means the person is a bit less strict with her following criteria and that some of them are not really that important to her. if this is the case we should focus a little more on getting nodes with smaller outdegree.

the formula i’ve come up with is to not consider the depth but instead add 1 + Log10(1+the outdegree of the previous node). in this way we penalise large outdegress, but not by a huge amount. we always add 1 to counter the cases where there are no edges leaving a node.