Github Conversations: Characterizing Framing & Measuring Openness

Aaksha Meghawat, Steven Moore, Qinlan Shen, Steven Dang, Carolyn P. Rose

Prologue

Who decides what is right? Who decides what is a ‘best practice’? The way open source developers, teams and software organizations deal with these questions is crucial to their success and the community as a whole.

The research I document in this post was carried out by me and a few of my amazing lab mates in 2016. With the rise of remote work and most work/design conversations (negotiations, decisions) moving to Slack, Zoom and other such online mechanisms, I have a new found appreciation for the insights we gleaned back in 2016.

General opinion seems to be that (other than likes, retweets) measures of the quality of a discussion such as openness, topic diversity, framing styles etc. cannot objectively be measured in any useful way. We challenged this assumption in our work and came across some interesting insights into developer productivity. Using our designs of conversation quality, we were able to improve the F1 score of predicting acceptance/rejection of a pull request by 66%. For 2 other key takeaways, you can jump directly to key takeaways.

The ultimate vision of this work was to translate our conversational quality measures into tools and interventions that would help developers and all kinds of teams have more productive conversations. Unbeknownst to us, many smart people had the same idea. Our ultimate dream would be make this a seamless real time conversation quality measure to help people have better conversations. The closest thing to this (out there) is Bridgewater Associates’ Dot Collector, (from 9:30 onwards in this video).

Introduction

Discourse in online work communities differs from that found in weblogs, online support groups, or news sites — the task-focused yet social nature of the community influences the nature of the conversation, as discussions are not solely focused on the ideas and issues at hand but also the social dynamics of the group (Richards, 2006). This raises the question of what are the dominant manifestations of this social influence on task discussions. Here we propose an analysis of Github pull-request conversation threads, looking at both how Openness and Framing are leveraged as well as how the two may interact to influence the success of a community pull-request contribution.

Looking at the data through the perspective of Openness, we wanted to see how accepting a project’s community was when discussing a pull request. We also looked at the data through a Framing lens to see how the topics discussed in a thread were indicative of its success. Combining these two lenses would hopefully provide insights into how a thread’s contributors made their ideas heard while also playing their part in the project’s community.

We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors. 

Research Questions

We define “success” of a conversation as acceptance/merge of a pull request. Our primary research questions then are whether the following have an impact on this “success” metric:

  • Openness: How do we define & detect openness computationally? (RQ1)
  • Framing Mechanisms: How do we define & detect different ways of framing, computationally? (RQ2)
  • Topic Diversity & Topic chains (Transactivity) (RQ3)

Data Description

We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors.

We picked out threads for qualitative analysis to get started on RQ1 & RQ2. We sampled threads from both highly active and less active communities. To do this, we stratified the communities according to total number of pull requests and sample two projects above and below the median number of pull requests. For each of these communities, we sampled 3 pull request threads. The threads in the selected communities were divided into three groups based on the number of comments in each thread (low, medium, and high number of comments).

Measuring Openness (RQ1)

Openness/Expansiveness in our work is defined as a statement which creates more dialogic space for another participant’s expression or opinion. This is often operationalized through a question or asking for explicit participation from other people. However it may also be demonstrated via expressing alternative views. We term a statement that has the opposite effect as Contractive. (Note that the usual positive connotation attached to ‘Openness’ should not be done so here. Vice versa for ‘Contractive’ statements. This is because too many expansive statements can also lead to confusion or slowness in decision making, resulting in failure of merging in pull requests, as we will show later in examples).

Following qualitative analysis, a list of expansive and contractive patterns was created from common patterns found. These initial patterns were input into the model where we then checked to see if they were appropriately tagging units as expansive or contractive. After a small sample was both qualitatively & quantitatively coded for expansive/contractive patterns with potentially interesting markers, we proceeded with affinity diagramming. During the affinity diagramming process, the markers that didn’t directly fit into expansive or contractive, such as using a particular acronym or ‘+1’, were grouped into various themes. After fitting them into their final groups, the context of the units for which these markers were used into was counted. This provided us with a ratio of many expansive, contractive, or neutral units these markers fell into. Markers that were most commonly found as either expansive or contractive by 75% or more were used as patterns for the respective classification. For example, the acronym ‘imo’ was used as a contractive pattern and the use of ‘?’ was used as an expansive pattern. 

Openness% Sample% Computational Tagging
Expansive20.216.36
Contractive3.782.67

Table 1: Percentage of units tagged as expansive or contractive (sample vs. computationally tagged dataset)

Characterizing Framing (RQ2)

We distilled 5 different framing types to be key for developer conversations on Github from our qualitative analysis:

CodeDefinitionExample
Establishing StatementIntroducing or supporting the factual/valid nature of a propositionIt’s very difficult to reproduce error because it depends on terminal width.
Alignment StatementStating agreement/disagreement on a specific discussion point including statements of appreciationGood idea @jferris. “+1”
Contrasting StatementDiscussing demonstrated or proposed attributes in contrast to alternatives
I would phase it out slowly, i.e. have has_plugin? recognize both current name and the gem name initially, then slowly deprecate it (with a warning possibly if the old name way is used).
Alignment RequestRequesting statement of stances on a specific discussion pointCould you please have a look there and comment?
Clarification RequestRequesting restatement/elaboration/clarification of a specific discussion pointsIf we leave this in: Is this something that needs to be generated during clearance install, if the users table exists?
Table 2: Framing Code Dictionary
Qualitative Analysis Methodology

Qualitative analysis began with reviewing 3 conversations annotating conversation fragments for two primary characteristics: framing related discourse moves and linguistic indicators of described framing moves. From this annotation library, an initial coding dictionary was formed by synthesizing the annotations into groups and forming operational definitions of each group. These codes were carried forward into the analysis of an additional 3 conversations in order to evaluate their robustness, ability to differentiate between statements, and coverage, ability to categorize all statements in the data. New framing moves were annotated and new synthesized definitions were formed from the revised groups. This analysis was repeated for the remaining 2 groups of 3 conversations until a final codebook was formed as shown below in Table 2.

Definitions of Framing Codes

The resulting code book discovered two broad classes of discourse moves related to Framing, Statements and Requests. Statements describe sentences that introduce information to the discourse while Requests are statements that solicit information from others to be introduced, where information can be factual or opinion-based in nature.

Establishing statements were the foundation of most discourse where they are describing all discourse moves that are introducing information with the purpose of establishing its factual nature. From a framing perspective these statements would be important to identify in the sense of determining which statements are introducing factual information and being able to identify which statements are debated and which are ignored and/or accepted.

Alignment statements are specifically discourse moves which introduce an individual’s agreement or disagreement with a particular fact or discussion point. These statements capture categories of discourse moves intended to introduce individual opinion into the conversation.

Contrasting statements are defined as discourse moves that introduce information intended for comparison or contrasting with previously stated or assumed information. These statements specifically contain multiple perspectives and relational information between each perspective.

Alignment Requests are defined as statements explicitly soliciting the opinions of others. Requests differ from statements in that they can also introduce information, but the primary goal of the discourse move is to introduce information with the intent of soliciting opinions of others regarding the stated information.

Clarification Requests are defined as statements explicitly soliciting additional information from others usually with respect to previously introduced or assumed information. These requests may also introduce information, but with the intent of supporting the specifics of the request for additional information.

Key Observations: Usually a contribution is elaborated early in the conversation and one or two specific points are discussed for the remainder of the thread. In the sample analyzed, two threads from the same community introduced comparable features, lazy loading, but each thread presented the value of the idea from different perspectives. The result is that one thread was merged and the other was rejected. On the other hand many of the rejected threads followed a pattern of short elaborated descriptions followed immediately by at least one contrasting statement which is then followed by alignment statements. This indicates that capturing patterns of these base categories of statements could be valuable in understanding the conversational dynamics.

Similar to Openness, we extracted patterns for our Framing categories. Then we tagged sentences in each post as belonging into one of the five framing classes. Using the sentence tokenizer in NLTK, we split each post into individual sentences. For each sentence, we then counted how many patterns fit the sentence for each of the framing categories. We took the class with the most number of pattern hits as the framing class of the sentence. Sentences that do not exhibit any of the patterns are assigned to the Establishing Statement class. For each post, we then maintained a percentage of sentences belonging to each category.

Transactive Chains (RQ3)

A major aspect of framing is the control of what topics are discussed in a thread. To operationalize this aspect of framing for our model, we introduced the concept of transactivity chains, a time-dependent chain of comments in a thread that are similar in what they discuss. The rationale behind these transactivity chains is that they represent different “flows” in the thread about what is being discussed — one transactivity chain represents the entire lifespan of a certain conversation subject. By examining the different transactivity chains across a thread, we can gain insight into what kind of conversational structures may be indicative of successful collaboration.

We define transactivity chains as a time-ordered chain of comments where posts that are connected have some semantic similarity. To measure the semantic similarity between posts, we run LDA (Blei et al. 2003) over our entire dataset, using 50 topics and treating individual comments as documents. This generates a topic model over our dataset and assigns each post a distribution over the generated topics. We then consider two posts to be semantically similar if they have a cosine distance of less than 0.5.

We provide a simple example to demonstrate how we represent the transactivity chains as features for our model. Consider Figure 1, where each numbered node represents a comment (time-ordered) in a thread and edges represent that two nodes are part of the same transactive chain. The thread represented by the figure consists of 6 posts that make up two transactivity chains.

Figure 1: Example transactive chain structure

We encode each of the transactivity threads as 3 features: the number of comments in the transactivity chain (#Comments), the number of comments the transactivity chain spans (ΔT), and the normalized comment distance of the final comment in the chain from the end of the thread (LastTimeSlot). This gives us the following set of features for each of transactivity chains in Figure 1:

ID#CommentsΔTimeLast Time Slot
1366/6
2335/6
Feature extraction for Transactive chain shown in Figure 1

To summarize these features at the thread level, we use the median, mean, min, and max of the transactivity chain features. We also use the median, mean, min, and max values for the percentage and total number of statements tagged as each of the openness and framing features among the transactivity chains of a thread as transactivity chain features.

Results & Key Takeaways

The surprising relationship between Openness & Transactivity

Graph 1: No. of Transactive Chains vs. Expansive Units

In Graph 1 we see that points with greater than ~50 expansive units are sparse in terms of demonstrating a trend. We intuitively expected that the more expansive conversation units there were in a conversation, the more topics (and no. of transactive chains) would be found in the thread. However as seen in Graph 2, the general tendency to the left of the red line is that an increase in expansiveness does not necessarily increase the number of transactive chains. On looking at some of the conversations our intuition was that increasing expansiveness created greater dialogic space that encouraged discussion about the current topic, infact encouraging continuity of a transactive chain and potentially greater consensus building.

Graph 2: Average number of topic chains in a conversation vs. number of Expansive comments.

This intuition is also reflected in Graph 3 where the length of the longest transactive chain increases with increasing presence of expansiveness.

Graph 3: Length of longest transactive chain vs. Expansiveness

Conversely, according to our expectation threads higher in Contrastiveness should have had a lower count of transactive chains, as the posts would tend to be monoglossic and dead end. However, we found that in the threads shown in Table 3, the thread’s commenters reference other users in the project who have not yet commented on the thread. This brings those users into the thread’s conversation where they then contribute to the discussion and generate new transactive chains. It also appears to add to the contrastiveness of the conversation, as it’s often the case that the users come into the conversation and immediately provide their opinion.

Example Thread# Transactive ChainsContrastive %age
https://github.com/jekyll/jekyll/pull/1682525
https://github.com/mitchellh/vagrant/pull/1592521.43
https://github.com/resque/resque/pull/772621.05
https://github.com/vcr/vcr/pull/404521.48
Table 3: Threads and their percentage of clarification requests and expansive units

Unconvering a best practice to increase developer productivity

An interesting finding we came across was the interaction between one of our framing categories, clarification requests, and our measure of openness. We found that in threads which had a high percentage of clarification requests and an equally high percentage of expansive units, developers were following a common pattern to make their voice heard. The developers always began by referencing a particular portion of the commit with a code block and then proceeded to ask a question about it. By following such a pattern, the commenters were able to have their question(s) and opinion immediately addressed.

This insight can be used directly to design a feature in github. For example, every time a developer has a question, they can be encouraged to refer to a particular block of code relevant to the question to encourage a more useful and productive discussion.

Prediction of Pull Request Acceptance (Primary Research Question)

The final experiment that we ran was to use our features to train a classifier to predict whether a pull request would be accepted or rejected. We train a logistic regression classifier and use 5-fold cross validation on our fully tagged dataset (which includes our openness tags, framing category tags, and transactivity chain features) to compare the performance of four different feature configurations:

  • Openness: % and total number of each openness category
  • Framing: % and total number of each framing category
  • Openness + Framing: combination of Openness and Framing features
  • Openness + Framing + TC: Openness, Framing, and Transactivity Chain Features (described in the Transactivity Chains section)

We compare these models to a baseline model, which predicts the majority class (Accepted).

ModelAccuracyPrecisionRecallF1
Baseline53.826.950.0034.98
Openness53.8752.4551.5047.41
Framing59.2459.2557.7256.67
Openness + Framing59.3159.1457.9757.24
Openness + Framing + TC59.5559.2458.4458.04
Table 4: Classification results for predicting the acceptance of a pull request

We improve the F1 score over baseline by 66% using our features of measuring openness, framing styles and topic diversity/transactive chains. This indicates that the features we have developed encapsulate meaningful quality measures of developer discussions and can be used to gauge the potential productivity of a group of open source developers.

Data Driven Case Study

As an example we examined the outlier thread (circled in red in Graph 1). We were able to neatly summarize the negotiation that happened in the conversation using our metrics.

Figure 4: The longest conversation in our data whose general flow was neatly summarized by our features.

Works Cited

  • Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802-822.
  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
  • Chatterjee, M. (2007). Textual engagements of a different kind? In Proceedings from Australian Systemic Functional Linguistics Association Congress.
  • Chong, Dennis, and James N. Druckman. “Framing theory.” Annu. Rev. Polit. Sci. 10 (2007): 103-126.
  • Cosentino, Valerio, Javier Luis Cánovas Izquierdo, and Jordi Cabot. “Three Metrics to Explore the Openness of GitHub projects.” arXiv preprint arXiv:1409.4253 (2014).
  • Howley, I., Mayfield, E., & Rosé, C. P. (2012). Linguistic analysis methods for studying small groups. The International Handbook of Collaborative Learning.
  • Gee, J. (2011). An Introduction to Discourse Analysis: Theory and Method, Routledge. 
  • Jo, Y., & Oh, A. H. (2011, February). Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 815-824). ACM.
  • Martin, J. & White, P. (2005). The Language of Evaluation: Appraisal in English, Palgrave
  • Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., & Wang, Y. (2014). Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning, 95(3), 381-421.
  • Read, J., & Carroll, J. (2012). Annotating expressions of appraisal in English. Language Resources and Evaluation, 46(3), 421-447.
  • Richards, K. (2006). Language and Professional Identity: Aspects of Collaborative Interaction, Palgrave, Chapter 3
  • Taboada, M., & Grieve, J. (2004, March). Analyzing appraisal automatically. In Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (pp. 158-161). AAAI Press.
  • Tsay, J., Dabbish, L., & Herbsleb, J. (2014). Let’s talk about it: evaluating contributions through discussion in GitHub. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (pp. 144-154). ACM.

This research work was supported by an NSF “BIGDATA” grant. Code, Data and Features can be provided on request.

3 years of Swimming. In Data.

As 2016 is *finally* coming to an end, a year whose repercussions will be felt for many more to come, I cannot but help reflect on the past few years that I have spent swimming. Swimming in data that is.

I have had the opportunity to be a productive participant in the field of data science from many perspectives. Having spent time both in academia and industry, and as both a producer and consumer of data science solutions, I have plenty of lessons that I remind myself of.  Here are the key takeaways from all those years of swimming in data and finding things that would hopefully be useful for someone:

It takes a village.

‘Big Data’ was a big hype word very recently. Until everyone realized that the ‘game-changing’, ‘life-changing’ and ‘world-changing’ data either did not exist or was in the hands of very few people or companies.

There is an unhealthy disregard in academia and industry for collecting data. The way research progresses in AI, ML and all the related fields which collectively affect our abilities to make the most of our data, there is a race to publish the most sophisticated mathematical solution.

In an article on ‘Why Deep Learning is changing your Life?‘ I was happy to note that there was a small paragraph dedicated to Fei-Fei Li, a Stanford AI professor who started the first serious concentrated effort to collect data for computer vision. It was only after she created ImageNet, that the computer vision researchers of the world could show-off their sophisticated math skills at making sense of complex data. It was this effort that created the arena where thousands of researchers having Olympic levels of math skills could finally meaningfully use their expertise.

Which brings me to my first point: Creating and having good datasets is essential, crucial and possibly more useful than all those complicated math models that some people pine away hours at first to build and then to make sense of. A good dataset with a simple model can sometimes tell you far more than the most complicated function approximator run on a bad dataset.

In short, the prestige incentives of academia are not always efficiently aligned with the real world need for making (game/life/world)-changing data science advances.

Now onto why I title this section as ‘It takes a village’.

My hunch is that the ‘Big Data’ hype was generated because the way the world operates, what matters to the big guys is usually projected as what should matter to everyone else. For all the hype and McKinsey reports on the big data revolution, the number of big data companies that actually exist in my opinion are exactly these: Google, Facebook and Amazon. (Apple & Microsoft notably have their foundations as product companies).

Analogous to how the problems of the developing world are never even discussed, let alone be fixed, the real data problems of the rest of us don’t garner enough attention. Most research problems in the field today are trying to solve problems for scenarios that largely exist for these companies (and NSA, shhh, secret).

Most questions in the industry that people now want to answer in a ‘smarter’ way because they know that a (smart) phone exists in every hand don’t have the right datasets to answer them. And definitely not in the context of the developing world. (There is a connection that I see between this gap and the recent wave of the Indian Startup scene, but more on that in another post).

There is another hurdle to the willingness of creation of such datasets. Data collection takes time and it also takes time and resources to start generating dividends. In industry, the Chinese whispers that happens between the sales team, the deal makers and everybody else where it is necessary to quickly book profits and revenues, solutions are sold before there is time to figure out whether the questions the client is asking can even be solved with the data available.

It takes guts to ask that question: ‘Do we even have the data?’ and consider the possibility of hearing back a ‘No’. Then it takes deep pockets to fix that issue. And then it takes perseverance to come up with a meaningful solution. Which is why I am not playing a blame game here. Businesses and companies need to be run and what needs to be sold to that end needs to be sold. But in the end only long term thinking can produce anything valuable.

So, in the industry too, the incentives are not always efficiently aligned with the real world need for making (game/life/world)-changing data science advances.

So as a data science PM/engineer/researcher etc. what does all this mean for you? As much as possible, fix that misalignment. *It takes a village* to build a useful data science solution right from the engineer who makes the system for data collection (if you have the luxury of having such a person in your vicinity) to the designer who can finally make your beautiful insights visible to those who matter. Get in touch with all of them and take them along, up and onwards with you. Every link in that pipeline matters.

Get your language right.
When you offer insights about/from data to someone, you operate in an abstract space of ideas. In such a scenario, you could be talking about the same thing and still not be on the same page. This is usually the case because you are quite literally using words that mean something different to the other party or using different meanings of the same word.

Again the Chinese whispers that happens between the sales team, the client management and company management means that by the time the data and the problem reaches the engineering team, several things have been lost in translation. In academia, when you have researchers from different backgrounds or fields collaborating, there are several different ways of expressing the same ideas creating potential space for misunderstanding.

For example, when you talk to a natural language processing researcher, the audio and visual information is ‘context’ or background, i.e. all the extra information that they may or may not want to include in their machine learning models. Similarly, when you talk to a speech technology researcher, language and visual cues are ‘context’. As someone who has worked on building machine learning models to combine all these communication media, I very quickly realized that while discussing or presenting my work, I would have to avoid the word ‘context’ like the plague if I had to get my point across to everyone together.

Listen. Listen. And then listen some more.

To get the previous point, you must listen to your audience’s language in the first place. But there is a bigger reason why there is a need to keep one’s ears wide open. In my short stint so far, I usually find that people actually don’t know the questions they want answers to. However, people do know very well the pain points of the problems they are facing. Try to listen to these in detail and with great patience. Even to the egoistic (a*holes) who might come to you full of insight and in no mood of listening to what the data is actually saying.

This does 2 important things for you. First you can decide whether a person really has a problem or has something that they have already decided to hear. Next if they do have a problem, it is better for everyone if you can quickly assess the situation and come up with the questions that can be answered and then the questions that need to be answered. Many times, they are not the same. If you are ever part of a project in which these happen to be the same, jump at it with all you’ve got. This does not happen often.

This is analogous to the famous Steve Jobs mantra that the consumers do not know what they want. So, you must get good at listening because before finding a solution, you must transform a problem into a good question.

Math is beautiful.

When I had just started out, I was brash and naïve. I wanted to apply the full force of my training and change things everyday. But there was a slow and painful process of realization that the world does not a) know its problems b) is not always looking for solutions to the known problems c) so many years of my education had not quite prepared me to face the factors that were important in solving problems.

I immediately changed gears to try to fix some of those issues in my life. The world is hard and disappointing. However, Math is only hard, not disappointing. As a data scientist I place my bets on Math and it is that bet that inspires me to get up everyday and get to work, no matter what.

I hope this was useful! Here’s wishing a happy productive 2017 to all!