In 2016, I started out with the goal of using natural language processing & discourse analysis to improve developer productivity on Github. Thanks to early success in our computational approaches, we were soon asking the question of whether we could predict if an open source project was on the path to failure or success. This was further motivated by the fact that a developer’s time is limited and therefore in Open Source communities, they need to make important decisions regarding which projects to depend upon as resources and which projects to contribute towards.
Our qualitative analysis to develop computational measures for conversational quality revealed that projects went through different phases of activity (Fig. 1). First, to help in characterization of the ‘activity’ phases, we collected several important activity parameters such as commits, pull request merges, pushes, issue comments, issues opened, issues closed, commit comments, pull request rejections and issues remaining open totaled per month of a project’s timeline on Github.
Next, we needed a model that would be able to detect these phases of a project assigning a clear phase identifier for the months of activity of the projects. This identification would make the wide variety of projects comparable at different points in time. For example, this would help in comparing the coordination style of a new small-sized project in a phase with an older project’s earlier days when it had been a small-sized project and was in the same activity phase.
“One can think of this as a health checkup of the open source project. The activity indicators are like the vital stats of the project. This combined with its history when compared to life histories of other successful or unsuccessful open source projects can give some early indication of its potential “health” problems and survival rate”.
We hoped that enabling such comparisons would drive insight as to the quality of coordination that may have helped a project in surviving long enough to attract more resources and become a big project. A clear interpretation of these states in terms of the parameters collected would help in evaluating the soundness of the model’s states. Further, our understanding of how the phases progressed and what kind of coordination and conversation would result in a project reaching favorable or unfavorable stages in their life cycles would improve. For this reason, we decided to apply models that would infer discrete states for the project timelines.
The other important characteristic of the project timelines that we discovered in our preliminary qualitative analysis was that developers worked in bursts of activity. This clearly meant that the activity (and phase) in one month was not un-correlated with the previous month’s activity in the sequence of months that formed the timeline of a project. We needed a model that would consider the temporal dependence of the activities reflecting the fact that some activity in one time unit increases the probability of seeing more activity in the next time unit. At the minimum, we needed a model that would relax the i.i.d. assumption for its data points (time units of months in our case).
We conceptualized that the discrete state would be an indicator of the phase of the project’s lifecycle which would be expressed via the activity visible on the platform. The project’s current state would determine the state that it would occupy in the next time unit. Since the phase of a project was not visible and was not specified directly on Github, our statistical model would have to infer the hidden state from the visible activity snapshots and learn a probabilistic distribution over these activity variables to better fit and account for the variety of activity that happens on the platform.
The Hidden Markov Model (HMM) fit all these requirements neatly. The markov property of the model captured the intuition that the present state determined the future state of a project. The HMM is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (i.e. hidden) states. Additionally, we chose to model the probabilistic relation between the discrete states and the activity on the platform via a multivariate Gaussian distribution. This way we could leverage the flexibility of a probabilistic fit on the observed activity to acquire a better model. At the same time we could take advantage of the interpretability of the Gaussian distribution by understanding every discrete state that the model inferred in terms of the mean value of the observed activity parameters.
We attempted to achieve the fine balance between having homogeneous samples in terms of the type of projects and them being diverse enough in terms of work coordination. For this reason we chose a single domain of development, Pypi projects and began with a large set of projects within this domain in Github. We began with a list of 48,668 github repositories mentioned in PyPi metadata records. From these we altered for projects having at least 10 stars or 3 contributors. We took stars and contributors as an approximation of the community’s interest in the project, leaving us with 16,682 projects.
There were discrepancies in many projects because of projects whose names had changed or for which data was missing or deleted from one data source or another. As we were looking at fine-grained interactions over short timespans, we wanted a particularly clean set of projects. We finally had 13,479 pypi projects in our set.
We trained a multivariate Gaussian hidden Markov Model (HMM) on these activity parameters. For the month level HMM, 12 discrete states provided the best fit on the activity time lines of a held out set of projects. We then included all the months that were labeled for discrete states of the Gaussian HMM which showed significant levels of activity. We focused our analysis on 6 states with significant transition probabilities (Figure 2). After filtering for the active months, only 4934 projects remained in our set as of the 13479 projects that we started with, only these many had active months.
The transition probabilities of the HMM states confirmed most of our qualitative conclusions of the activity patterns of projects. Any project was most likely to start and stay in the dormant state. Furthermore a project’s probability to maintain its current state, whether dormant or active was higher than transitioning to a different state. The trend of staying in the dormant state created the issue of data sparsity. The trend of staying in the active state created the phenomenon of bursts. Both these intuitions are visible in the HMM model.
It may be argued that one could set a low threshold on the activity parameters to capture most activity. However, in our qualitative study we found that low activity which preceded or followed high active states was also crucial in determining the coordination quality during that burst. All low activity is not important however and can be ignored to avoid noise. A markov model can automatically take care of such subtleties.
This HMM lifecycle framework has already been used as a filtering mechanism to focus on different phases of a project by other researchers. This helps them zero in on the phases of a project they are most interested in to observe or measure other metrics relevant to their problem statement.
Potential future work also includes gaining a better understanding of the kinds of events/factors that cause the project to transition from one state to another, eventually estimating their survival rate.
Who decides what is right? Who decides what is a ‘best practice’? The way open source developers, teams and software organizations deal with these questions is crucial to their success and the community as a whole.
The research I document in this post was carried out by me and a few of my amazing lab mates in 2016. With the rise of remote work and most work/design conversations (negotiations, decisions) moving to Slack, Zoom and other such online mechanisms, I have a new found appreciation for the insights we gleaned back in 2016.
General opinion seems to be that (other than likes, retweets) measures of the quality of a discussion such as openness, topic diversity, framing styles etc. cannot objectively be measured in any useful way. We challenged this assumption in our work and came across some interesting insights into developer productivity. Using our designs of conversation quality, we were able to improve the F1 score of predicting acceptance/rejection of a pull request by 66%. For 2 other key takeaways, you can jump directly to key takeaways.
The ultimate vision of this work was to translate our conversational quality measures into tools and interventions that would help developers and all kinds of teams have more productive conversations. Unbeknownst to us, many smart people had the same idea. Our ultimate dream would be make this a seamless real time conversation quality measure to help people have better conversations. The closest thing to this (out there) is Bridgewater Associates’ Dot Collector, (from 9:30 onwards in this video).
Discourse in online work communities differs from that found in weblogs, online support groups, or news sites — the task-focused yet social nature of the community influences the nature of the conversation, as discussions are not solely focused on the ideas and issues at hand but also the social dynamics of the group (Richards, 2006). This raises the question of what are the dominant manifestations of this social influence on task discussions. Here we propose an analysis of Github pull-request conversation threads, looking at both how Openness and Framing are leveraged as well as how the two may interact to influence the success of a community pull-request contribution.
Looking at the data through the perspective of Openness, we wanted to see how accepting a project’s community was when discussing a pull request. We also looked at the data through a Framing lens to see how the topics discussed in a thread were indicative of its success. Combining these two lenses would hopefully provide insights into how a thread’s contributors made their ideas heard while also playing their part in the project’s community.
We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors.
We define “success” of a conversation as acceptance/merge of a pull request. Our primary research questions then are whether the following have an impact on this “success” metric:
Openness: How do we define & detect openness computationally? (RQ1)
Framing Mechanisms: How do we define & detect different ways of framing, computationally? (RQ2)
We analyzed pull request conversations from a subset of GitHub projects for Ruby. The data consists of 1862 projects that contain at least one pull request. From these projects, we picked out threads that have at least three different contributors.
We picked out threads for qualitative analysis to get started on RQ1 & RQ2. We sampled threads from both highly active and less active communities. To do this, we stratified the communities according to total number of pull requests and sample two projects above and below the median number of pull requests. For each of these communities, we sampled 3 pull request threads. The threads in the selected communities were divided into three groups based on the number of comments in each thread (low, medium, and high number of comments).
Measuring Openness (RQ1)
Openness/Expansiveness in our work is defined as a statement which creates more dialogic space for another participant’s expression or opinion. This is often operationalized through a question or asking for explicit participation from other people. However it may also be demonstrated via expressing alternative views. We term a statement that has the opposite effect as Contractive. (Note that the usual positive connotation attached to ‘Openness’ should not be done so here. Vice versa for ‘Contractive’ statements. This is because too many expansive statements can also lead to confusion or slowness in decision making, resulting in failure of merging in pull requests, as we will show later in examples).
Following qualitative analysis, a list of expansive and contractive patterns was created from common patterns found. These initial patterns were input into the model where we then checked to see if they were appropriately tagging units as expansive or contractive. After a small sample was both qualitatively & quantitatively coded for expansive/contractive patterns with potentially interesting markers, we proceeded with affinity diagramming. During the affinity diagramming process, the markers that didn’t directly fit into expansive or contractive, such as using a particular acronym or ‘+1’, were grouped into various themes. After fitting them into their final groups, the context of the units for which these markers were used into was counted. This provided us with a ratio of many expansive, contractive, or neutral units these markers fell into. Markers that were most commonly found as either expansive or contractive by 75% or more were used as patterns for the respective classification. For example, the acronym ‘imo’ was used as a contractive pattern and the use of ‘?’ was used as an expansive pattern.
% Computational Tagging
Table 1: Percentage of units tagged as expansive or contractive (sample vs. computationally tagged dataset)
Characterizing Framing (RQ2)
We distilled 5 different framing types to be key for developer conversations on Github from our qualitative analysis:
Introducing or supporting the factual/valid nature of a proposition
It’s very difficult to reproduce error because it depends on terminal width.
Stating agreement/disagreement on a specific discussion point including statements of appreciation
Good idea @jferris. “+1”
Discussing demonstrated or proposed attributes in contrast to alternatives
I would phase it out slowly, i.e. have has_plugin? recognize both current name and the gem name initially, then slowly deprecate it (with a warning possibly if the old name way is used).
Requesting statement of stances on a specific discussion point
Could you please have a look there and comment?
Requesting restatement/elaboration/clarification of a specific discussion points
If we leave this in: Is this something that needs to be generated during clearance install, if the users table exists?
Table 2: Framing Code Dictionary
Qualitative Analysis Methodology
Qualitative analysis began with reviewing 3 conversations annotating conversation fragments for two primary characteristics: framing related discourse moves and linguistic indicators of described framing moves. From this annotation library, an initial coding dictionary was formed by synthesizing the annotations into groups and forming operational definitions of each group. These codes were carried forward into the analysis of an additional 3 conversations in order to evaluate their robustness, ability to differentiate between statements, and coverage, ability to categorize all statements in the data. New framing moves were annotated and new synthesized definitions were formed from the revised groups. This analysis was repeated for the remaining 2 groups of 3 conversations until a final codebook was formed as shown below in Table 2.
Definitions of Framing Codes
The resulting code book discovered two broad classes of discourse moves related to Framing, Statements and Requests. Statements describe sentences that introduce information to the discourse while Requests are statements that solicit information from others to be introduced, where information can be factual or opinion-based in nature.
Establishing statements were the foundation of most discourse where they are describing all discourse moves that are introducing information with the purpose of establishing its factual nature. From a framing perspective these statements would be important to identify in the sense of determining which statements are introducing factual information and being able to identify which statements are debated and which are ignored and/or accepted.
Alignment statements are specifically discourse moves which introduce an individual’s agreement or disagreement with a particular fact or discussion point. These statements capture categories of discourse moves intended to introduce individual opinion into the conversation.
Contrasting statements are defined as discourse moves that introduce information intended for comparison or contrasting with previously stated or assumed information. These statements specifically contain multiple perspectives and relational information between each perspective.
Alignment Requests are defined as statements explicitly soliciting the opinions of others. Requests differ from statements in that they can also introduce information, but the primary goal of the discourse move is to introduce information with the intent of soliciting opinions of others regarding the stated information.
Clarification Requests are defined as statements explicitly soliciting additional information from others usually with respect to previously introduced or assumed information. These requests may also introduce information, but with the intent of supporting the specifics of the request for additional information.
Key Observations: Usually a contribution is elaborated early in the conversation and one or two specific points are discussed for the remainder of the thread. In the sample analyzed, two threads from the same community introduced comparable features, lazy loading, but each thread presented the value of the idea from different perspectives. The result is that one thread was merged and the other was rejected. On the other hand many of the rejected threads followed a pattern of short elaborated descriptions followed immediately by at least one contrasting statement which is then followed by alignment statements. This indicates that capturing patterns of these base categories of statements could be valuable in understanding the conversational dynamics.
Similar to Openness, we extracted patterns for our Framing categories. Then we tagged sentences in each post as belonging into one of the five framing classes. Using the sentence tokenizer in NLTK, we split each post into individual sentences. For each sentence, we then counted how many patterns fit the sentence for each of the framing categories. We took the class with the most number of pattern hits as the framing class of the sentence. Sentences that do not exhibit any of the patterns are assigned to the Establishing Statement class. For each post, we then maintained a percentage of sentences belonging to each category.
Transactive Chains (RQ3)
A major aspect of framing is the control of what topics are discussed in a thread. To operationalize this aspect of framing for our model, we introduced the concept of transactivity chains, a time-dependent chain of comments in a thread that are similar in what they discuss. The rationale behind these transactivity chains is that they represent different “flows” in the thread about what is being discussed — one transactivity chain represents the entire lifespan of a certain conversation subject. By examining the different transactivity chains across a thread, we can gain insight into what kind of conversational structures may be indicative of successful collaboration.
We define transactivity chains as a time-ordered chain of comments where posts that are connected have some semantic similarity. To measure the semantic similarity between posts, we run LDA (Blei et al. 2003) over our entire dataset, using 50 topics and treating individual comments as documents. This generates a topic model over our dataset and assigns each post a distribution over the generated topics. We then consider two posts to be semantically similar if they have a cosine distance of less than 0.5.
We provide a simple example to demonstrate how we represent the transactivity chains as features for our model. Consider Figure 1, where each numbered node represents a comment (time-ordered) in a thread and edges represent that two nodes are part of the same transactive chain. The thread represented by the figure consists of 6 posts that make up two transactivity chains.
We encode each of the transactivity threads as 3 features: the number of comments in the transactivity chain (#Comments), the number of comments the transactivity chain spans (ΔT), and the normalized comment distance of the final comment in the chain from the end of the thread (LastTimeSlot). This gives us the following set of features for each of transactivity chains in Figure 1:
Last Time Slot
Feature extraction for Transactive chain shown in Figure 1
To summarize these features at the thread level, we use the median, mean, min, and max of the transactivity chain features. We also use the median, mean, min, and max values for the percentage and total number of statements tagged as each of the openness and framing features among the transactivity chains of a thread as transactivity chain features.
Results & Key Takeaways
The surprising relationship between Openness & Transactivity
In Graph 1 we see that points with greater than ~50 expansive units are sparse in terms of demonstrating a trend. We intuitively expected that the more expansive conversation units there were in a conversation, the more topics (and no. of transactive chains) would be found in the thread. However as seen in Graph 2, the general tendency to the left of the red line is that an increase in expansiveness does not necessarily increase the number of transactive chains. On looking at some of the conversations our intuition was that increasing expansiveness created greater dialogic space that encouraged discussion about the current topic, infact encouraging continuity of a transactive chain and potentially greater consensus building.
This intuition is also reflected in Graph 3 where the length of the longest transactive chain increases with increasing presence of expansiveness.
Conversely, according to our expectation threads higher in Contrastiveness should have had a lower count of transactive chains, as the posts would tend to be monoglossic and dead end. However, we found that in the threads shown in Table 3, the thread’s commenters reference other users in the project who have not yet commented on the thread. This brings those users into the thread’s conversation where they then contribute to the discussion and generate new transactive chains. It also appears to add to the contrastiveness of the conversation, as it’s often the case that the users come into the conversation and immediately provide their opinion.
Table 3: Threads and their percentage of clarification requests and expansive units
Unconvering a best practice to increase developer productivity
An interesting finding we came across was the interaction between one of our framing categories, clarification requests, and our measure of openness. We found that in threads which had a high percentage of clarification requests and an equally high percentage of expansive units, developers were following a common pattern to make their voice heard. The developers always began by referencing a particular portion of the commit with a code block and then proceeded to ask a question about it. By following such a pattern, the commenters were able to have their question(s) and opinion immediately addressed.
This insight can be used directly to design a feature in github. For example, every time a developer has a question, they can be encouraged to refer to a particular block of code relevant to the question to encourage a more useful and productive discussion.
Prediction of Pull Request Acceptance (Primary Research Question)
The final experiment that we ran was to use our features to train a classifier to predict whether a pull request would be accepted or rejected. We train a logistic regression classifier and use 5-fold cross validation on our fully tagged dataset (which includes our openness tags, framing category tags, and transactivity chain features) to compare the performance of four different feature configurations:
Openness: % and total number of each openness category
Framing: % and total number of each framing category
Openness + Framing: combination of Openness and Framing features
Openness + Framing + TC: Openness, Framing, and Transactivity Chain Features (described in the Transactivity Chains section)
We compare these models to a baseline model, which predicts the majority class (Accepted).
Openness + Framing
Openness + Framing + TC
Table 4: Classification results for predicting the acceptance of a pull request
We improve the F1 score over baseline by 66% using our features of measuring openness, framing styles and topic diversity/transactive chains. This indicates that the features we have developed encapsulate meaningful quality measures of developer discussions and can be used to gauge the potential productivity of a group of open source developers.
Data Driven Case Study
As an example we examined the outlier thread (circled in red in Graph 1). We were able to neatly summarize the negotiation that happened in the conversation using our metrics.
Argamon, S., Whitelaw, C., Chase, P., Hota, S. R., Garg, N., & Levitan, S. (2007). Stylistic text classification using functional lexical features. Journal of the American Society for Information Science and Technology, 58(6), 802-822.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 993-1022.
Chatterjee, M. (2007). Textual engagements of a different kind? In Proceedings from Australian Systemic Functional Linguistics Association Congress.
Chong, Dennis, and James N. Druckman. “Framing theory.” Annu. Rev. Polit. Sci. 10 (2007): 103-126.
Cosentino, Valerio, Javier Luis Cánovas Izquierdo, and Jordi Cabot. “Three Metrics to Explore the Openness of GitHub projects.” arXiv preprint arXiv:1409.4253 (2014).
Howley, I., Mayfield, E., & Rosé, C. P. (2012). Linguistic analysis methods for studying small groups. The International Handbook of Collaborative Learning.
Gee, J. (2011). An Introduction to Discourse Analysis: Theory and Method, Routledge.
Jo, Y., & Oh, A. H. (2011, February). Aspect and sentiment unification model for online review analysis. In Proceedings of the fourth ACM international conference on Web search and data mining (pp. 815-824). ACM.
Martin, J. & White, P. (2005). The Language of Evaluation: Appraisal in English, Palgrave
Nguyen, V. A., Boyd-Graber, J., Resnik, P., Cai, D. A., Midberry, J. E., & Wang, Y. (2014). Modeling topic control to detect influence in conversations using nonparametric topic models. Machine Learning, 95(3), 381-421.
Read, J., & Carroll, J. (2012). Annotating expressions of appraisal in English. Language Resources and Evaluation, 46(3), 421-447.
Richards, K. (2006). Language and Professional Identity: Aspects of Collaborative Interaction, Palgrave, Chapter 3
Taboada, M., & Grieve, J. (2004, March). Analyzing appraisal automatically. In Proceedings of AAAI Spring Symposium on Exploring Attitude and Affect in Text (pp. 158-161). AAAI Press.
Tsay, J., Dabbish, L., & Herbsleb, J. (2014). Let’s talk about it: evaluating contributions through discussion in GitHub. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (pp. 144-154). ACM.
This research work was supported by an NSF “BIGDATA” grant. Code, Data and Features can be provided on request.