Aaksha Meghawat, Steven Moore, Qinlan Shen, Steven Dang, Carolyn P. Rose
Prologue
Who decides what is right? Who decides what is a ‘best practice’? The way open source developers, teams and software organizations deal with these questions is crucial to their success and the community as a whole.
The research I document in this post was carried out by me and a few of my amazing lab mates in 2016. With the rise of remote work and most work/design conversations (negotiations, decisions) moving to Slack, Zoom and other such online mechanisms, I have a new found appreciation for the insights we gleaned back in 2016.
General opinion seems to be that (other than likes, retweets) measures of the quality of a discussion such as openness, topic diversity, framing styles etc. cannot objectively be measured in any useful way. We challenged this assumption in our work and came across some interesting insights into developer productivity. Using our designs of conversation quality, we were able to improve the F1 score of predicting acceptance/rejection of a pull request by 66%. For 2 other key takeaways, you can jump directly to key takeaways.
The ultimate vision of this work was to translate our conversational quality measures into tools and interventions that would help developers and all kinds of teams have more productive conversations. Unbeknownst to us, many smart people had the same idea. Our ultimate dream would be make this a seamless real time conversation quality measure to help people have better conversations. The closest thing to this (out there) is Bridgewater Associates’ Dot Collector, (from 9:30 onwards in this video).
Introduction
Discourse in online work communities differs from that found in weblogs, online support groups, or news sites — the task-focused yet social nature of the community influences the nature of the conversation, as discussions are not solely focused on the ideas and issues at hand but also the social dynamics of the group (Richards, 2006). This raises the question of what are the dominant manifestations of this social influence on task discussions. Here we propose an analysis of Github pull-request conversation threads, looking at both how Openness and Framing are leveraged as well as how the two may interact to influence the success of a community pull-request contribution.
Looking at the data through the perspective of Openness, we wanted to see how accepting a project’s community was when discussing a pull request. We also looked at the data through a Framing lens to see how the topics discussed in a thread were indicative of its success. Combining these two lenses would hopefully provide insights into how a thread’s contributors made their ideas heard while also playing their part in the project’s community.
We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors.
Research Questions
We define “success” of a conversation as acceptance/merge of a pull request. Our primary research questions then are whether the following have an impact on this “success” metric:
- Openness: How do we define & detect openness computationally? (RQ1)
- Framing Mechanisms: How do we define & detect different ways of framing, computationally? (RQ2)
- Topic Diversity & Topic chains (Transactivity) (RQ3)
Data Description
We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors.
We picked out threads for qualitative analysis to get started on RQ1 & RQ2. We sampled threads from both highly active and less active communities. To do this, we stratified the communities according to total number of pull requests and sample two projects above and below the median number of pull requests. For each of these communities, we sampled 3 pull request threads. The threads in the selected communities were divided into three groups based on the number of comments in each thread (low, medium, and high number of comments).
Measuring Openness (RQ1)
Openness/Expansiveness in our work is defined as a statement which creates more dialogic space for another participant’s expression or opinion. This is often operationalized through a question or asking for explicit participation from other people. However it may also be demonstrated via expressing alternative views. We term a statement that has the opposite effect as Contractive. (Note that the usual positive connotation attached to ‘Openness’ should not be done so here. Vice versa for ‘Contractive’ statements. This is because too many expansive statements can also lead to confusion or slowness in decision making, resulting in failure of merging in pull requests, as we will show later in examples).
Following qualitative analysis, a list of expansive and contractive patterns was created from common patterns found. These initial patterns were input into the model where we then checked to see if they were appropriately tagging units as expansive or contractive. After a small sample was both qualitatively & quantitatively coded for expansive/contractive patterns with potentially interesting markers, we proceeded with affinity diagramming. During the affinity diagramming process, the markers that didn’t directly fit into expansive or contractive, such as using a particular acronym or ‘+1’, were grouped into various themes. After fitting them into their final groups, the context of the units for which these markers were used into was counted. This provided us with a ratio of many expansive, contractive, or neutral units these markers fell into. Markers that were most commonly found as either expansive or contractive by 75% or more were used as patterns for the respective classification. For example, the acronym ‘imo’ was used as a contractive pattern and the use of ‘?’ was used as an expansive pattern.
Openness | % Sample | % Computational Tagging |
Expansive | 20.2 | 16.36 |
Contractive | 3.78 | 2.67 |
Table 1: Percentage of units tagged as expansive or contractive (sample vs. computationally tagged dataset)
Characterizing Framing (RQ2)
We distilled 5 different framing types to be key for developer conversations on Github from our qualitative analysis:
Code | Definition | Example |
Establishing Statement | Introducing or supporting the factual/valid nature of a proposition | It’s very difficult to reproduce error because it depends on terminal width. |
Alignment Statement | Stating agreement/disagreement on a specific discussion point including statements of appreciation | Good idea @jferris. “+1” |
Contrasting Statement | Discussing demonstrated or proposed attributes in contrast to alternatives | I would phase it out slowly, i.e. have has_plugin? recognize both current name and the gem name initially, then slowly deprecate it (with a warning possibly if the old name way is used). |
Alignment Request | Requesting statement of stances on a specific discussion point | Could you please have a look there and comment? |
Clarification Request | Requesting restatement/elaboration/clarification of a specific discussion points | If we leave this in: Is this something that needs to be generated during clearance install, if the users table exists? |
Qualitative Analysis Methodology
Qualitative analysis began with reviewing 3 conversations annotating conversation fragments for two primary characteristics: framing related discourse moves and linguistic indicators of described framing moves. From this annotation library, an initial coding dictionary was formed by synthesizing the annotations into groups and forming operational definitions of each group. These codes were carried forward into the analysis of an additional 3 conversations in order to evaluate their robustness, ability to differentiate between statements, and coverage, ability to categorize all statements in the data. New framing moves were annotated and new synthesized definitions were formed from the revised groups. This analysis was repeated for the remaining 2 groups of 3 conversations until a final codebook was formed as shown below in Table 2.
Definitions of Framing Codes
The resulting code book discovered two broad classes of discourse moves related to Framing, Statements and Requests. Statements describe sentences that introduce information to the discourse while Requests are statements that solicit information from others to be introduced, where information can be factual or opinion-based in nature.
Establishing statements were the foundation of most discourse where they are describing all discourse moves that are introducing information with the purpose of establishing its factual nature. From a framing perspective these statements would be important to identify in the sense of determining which statements are introducing factual information and being able to identify which statements are debated and which are ignored and/or accepted.
Alignment statements are specifically discourse moves which introduce an individual’s agreement or disagreement with a particular fact or discussion point. These statements capture categories of discourse moves intended to introduce individual opinion into the conversation.
Contrasting statements are defined as discourse moves that introduce information intended for comparison or contrasting with previously stated or assumed information. These statements specifically contain multiple perspectives and relational information between each perspective.
Alignment Requests are defined as statements explicitly soliciting the opinions of others. Requests differ from statements in that they can also introduce information, but the primary goal of the discourse move is to introduce information with the intent of soliciting opinions of others regarding the stated information.
Clarification Requests are defined as statements explicitly soliciting additional information from others usually with respect to previously introduced or assumed information. These requests may also introduce information, but with the intent of supporting the specifics of the request for additional information.
Key Observations: Usually a contribution is elaborated early in the conversation and one or two specific points are discussed for the remainder of the thread. In the sample analyzed, two threads from the same community introduced comparable features, lazy loading, but each thread presented the value of the idea from different perspectives. The result is that one thread was merged and the other was rejected. On the other hand many of the rejected threads followed a pattern of short elaborated descriptions followed immediately by at least one contrasting statement which is then followed by alignment statements. This indicates that capturing patterns of these base categories of statements could be valuable in understanding the conversational dynamics.
Similar to Openness, we extracted patterns for our Framing categories. Then we tagged sentences in each post as belonging into one of the five framing classes. Using the sentence tokenizer in NLTK, we split each post into individual sentences. For each sentence, we then counted how many patterns fit the sentence for each of the framing categories. We took the class with the most number of pattern hits as the framing class of the sentence. Sentences that do not exhibit any of the patterns are assigned to the Establishing Statement class. For each post, we then maintained a percentage of sentences belonging to each category.
Transactive Chains (RQ3)
A major aspect of framing is the control of what topics are discussed in a thread. To operationalize this aspect of framing for our model, we introduced the concept of transactivity chains, a time-dependent chain of comments in a thread that are similar in what they discuss. The rationale behind these transactivity chains is that they represent different “flows” in the thread about what is being discussed — one transactivity chain represents the entire lifespan of a certain conversation subject. By examining the different transactivity chains across a thread, we can gain insight into what kind of conversational structures may be indicative of successful collaboration.
We define transactivity chains as a time-ordered chain of comments where posts that are connected have some semantic similarity. To measure the semantic similarity between posts, we run LDA (Blei et al. 2003) over our entire dataset, using 50 topics and treating individual comments as documents. This generates a topic model over our dataset and assigns each post a distribution over the generated topics. We then consider two posts to be semantically similar if they have a cosine distance of less than 0.5.
We provide a simple example to demonstrate how we represent the transactivity chains as features for our model. Consider Figure 1, where each numbered node represents a comment (time-ordered) in a thread and edges represent that two nodes are part of the same transactive chain. The thread represented by the figure consists of 6 posts that make up two transactivity chains.
We encode each of the transactivity threads as 3 features: the number of comments in the transactivity chain (#Comments), the number of comments the transactivity chain spans (ΔT), and the normalized comment distance of the final comment in the chain from the end of the thread (LastTimeSlot). This gives us the following set of features for each of transactivity chains in Figure 1:
ID | #Comments | ΔTime | Last Time Slot |
1 | 3 | 6 | 6/6 |
2 | 3 | 3 | 5/6 |
To summarize these features at the thread level, we use the median, mean, min, and max of the transactivity chain features. We also use the median, mean, min, and max values for the percentage and total number of statements tagged as each of the openness and framing features among the transactivity chains of a thread as transactivity chain features.
Results & Key Takeaways
The surprising relationship between Openness & Transactivity

In Graph 1 we see that points with greater than ~50 expansive units are sparse in terms of demonstrating a trend. We intuitively expected that the more expansive conversation units there were in a conversation, the more topics (and no. of transactive chains) would be found in the thread. However as seen in Graph 2, the general tendency to the left of the red line is that an increase in expansiveness does not necessarily increase the number of transactive chains. On looking at some of the conversations our intuition was that increasing expansiveness created greater dialogic space that encouraged discussion about the current topic, infact encouraging continuity of a transactive chain and potentially greater consensus building.
This intuition is also reflected in Graph 3 where the length of the longest transactive chain increases with increasing presence of expansiveness.

Conversely, according to our expectation threads higher in Contrastiveness should have had a lower count of transactive chains, as the posts would tend to be monoglossic and dead end. However, we found that in the threads shown in Table 3, the thread’s commenters reference other users in the project who have not yet commented on the thread. This brings those users into the thread’s conversation where they then contribute to the discussion and generate new transactive chains. It also appears to add to the contrastiveness of the conversation, as it’s often the case that the users come into the conversation and immediately provide their opinion.
Example Thread | # Transactive Chains | Contrastive %age |
https://github.com/jekyll/jekyll/pull/1682 | 5 | 25 |
https://github.com/mitchellh/vagrant/pull/1592 | 5 | 21.43 |
https://github.com/resque/resque/pull/772 | 6 | 21.05 |
https://github.com/vcr/vcr/pull/404 | 5 | 21.48 |
Unconvering a best practice to increase developer productivity
An interesting finding we came across was the interaction between one of our framing categories, clarification requests, and our measure of openness. We found that in threads which had a high percentage of clarification requests and an equally high percentage of expansive units, developers were following a common pattern to make their voice heard. The developers always began by referencing a particular portion of the commit with a code block and then proceeded to ask a question about it. By following such a pattern, the commenters were able to have their question(s) and opinion immediately addressed.
This insight can be used directly to design a feature in github. For example, every time a developer has a question, they can be encouraged to refer to a particular block of code relevant to the question to encourage a more useful and productive discussion.
Prediction of Pull Request Acceptance (Primary Research Question)
The final experiment that we ran was to use our features to train a classifier to predict whether a pull request would be accepted or rejected. We train a logistic regression classifier and use 5-fold cross validation on our fully tagged dataset (which includes our openness tags, framing category tags, and transactivity chain features) to compare the performance of four different feature configurations:
- Openness: % and total number of each openness category
- Framing: % and total number of each framing category
- Openness + Framing: combination of Openness and Framing features
- Openness + Framing + TC: Openness, Framing, and Transactivity Chain Features (described in the Transactivity Chains section)
We compare these models to a baseline model, which predicts the majority class (Accepted).
Model | Accuracy | Precision | Recall | F1 |
Baseline | 53.8 | 26.9 | 50.00 | 34.98 |
Openness | 53.87 | 52.45 | 51.50 | 47.41 |
Framing | 59.24 | 59.25 | 57.72 | 56.67 |
Openness + Framing | 59.31 | 59.14 | 57.97 | 57.24 |
Openness + Framing + TC | 59.55 | 59.24 | 58.44 | 58.04 |
We improve the F1 score over baseline by 66% using our features of measuring openness, framing styles and topic diversity/transactive chains. This indicates that the features we have developed encapsulate meaningful quality measures of developer discussions and can be used to gauge the potential productivity of a group of open source developers.
Data Driven Case Study
As an example we examined the outlier thread (circled in red in Graph 1). We were able to neatly summarize the negotiation that happened in the conversation using our metrics.

Works Cited
- Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802-822.
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
- Chatterjee, M. (2007). Textual engagements of a different kind? In Proceedings from Australian Systemic Functional Linguistics Association Congress.
- Chong, Dennis, and James N. Druckman. “Framing theory.” Annu. Rev. Polit. Sci. 10 (2007): 103-126.
- Cosentino, Valerio, Javier Luis Cánovas Izquierdo, and Jordi Cabot. “Three Metrics to Explore the Openness of GitHub projects.” arXiv preprint arXiv:1409.4253 (2014).
- Howley, I., Mayfield, E., & Rosé, C. P. (2012). Linguistic analysis methods for studying small groups. The International Handbook of Collaborative Learning.
- Gee, J. (2011). An Introduction to Discourse Analysis: Theory and Method, Routledge.
- Jo, Y., & Oh, A. H. (2011, February). Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 815-824). ACM.
- Martin, J. & White, P. (2005). The Language of Evaluation: Appraisal in English, Palgrave
- Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., & Wang, Y. (2014). Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning, 95(3), 381-421.
- Read, J., & Carroll, J. (2012). Annotating expressions of appraisal in English. Language Resources and Evaluation, 46(3), 421-447.
- Richards, K. (2006). Language and Professional Identity: Aspects of Collaborative Interaction, Palgrave, Chapter 3
- Taboada, M., & Grieve, J. (2004, March). Analyzing appraisal automatically. In Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (pp. 158-161). AAAI Press.
- Tsay, J., Dabbish, L., & Herbsleb, J. (2014). Let’s talk about it: evaluating contributions through discussion in GitHub. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (pp. 144-154). ACM.
This research work was supported by an NSF “BIGDATA” grant. Code, Data and Features can be provided on request.