`scikit-learn`#

# For interactive plots
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import TeX
output_notebook()

Loading BokehJS ...

A snapshot of the development on the scikit-learn project.

Issues#

query_date = np.datetime64("2020-01-01 00:00:00")

# Load data
with open("devstats-data/scikit-learn_issues.json", "r") as fh:
    issues = [item["node"] for item in json.loads(fh.read())]

glue("devstats-data/scikit-learn_query_date", str(query_date.astype("M8[D]")))

New issues#

4579 new issues have been opened since 2020-01-01, of which 3581 (78%) have been closed.

The median lifetime of new issues that were created and closed in this period is 118 hours.

query_date = np.datetime64("2020-01-01 00:00:00")

# Load data
with open("devstats-data/scikit-learn_issues.json", "r") as fh:
    issues = [item["node"] for item in json.loads(fh.read())]

glue("scikit-learn_query_date", str(query_date.astype("M8[D]")))

Time to response#

Show code cell source Hide code cell source

# Remove issues that are less than a day old for the following analysis
newly_created_day_old = [
    iss for iss in newly_created
    if (np.datetime64(datetime.datetime.now())
        - np.datetime64(iss["createdAt"].rstrip("Z"))) > np.timedelta64(1, "D")
]

# TODO: really need pandas here
commented_issues = [
    iss for iss in newly_created_day_old
    if any(
        e["node"]["__typename"] == "IssueComment" for e in iss["timelineItems"]["edges"]
    )
]
first_commenters, time_to_first_comment = [], []
for iss in commented_issues:
    for e in iss["timelineItems"]["edges"]:
        if e["node"]["__typename"] == "IssueComment":
            try:
                user = e["node"]["author"]["login"]
            except TypeError as err:
                # This can happen e.g. when a user deletes their GH acct
                user = "UNKNOWN"
            first_commenters.append(user)
            dt = (np.datetime64(e["node"]["createdAt"].rstrip("Z"))
                  - np.datetime64(iss["createdAt"].rstrip("Z")))
            time_to_first_comment.append(dt.astype("m8[m]"))
            break  # Only want the first commenter
time_to_first_comment = np.array(time_to_first_comment)  # in minutes

median_time_til_first_response = np.median(time_to_first_comment.astype(int) / 60)

cutoffs = [
    np.timedelta64(1, "h"),
    np.timedelta64(12, "h"),
    np.timedelta64(24, "h"),
    np.timedelta64(3, "D"),
    np.timedelta64(7, "D"),
    np.timedelta64(14, "D"),
]
num_issues_commented_by_cutoff = np.array(
    [
        np.sum(time_to_first_comment < cutoff) for cutoff in cutoffs
    ]
)

# TODO: Update IssueComment query to include:
#  - whether the commenter is a maintainer
#  - datetime of comment
# This will allow analysis of what fraction of issues are addressed by
# maintainers vs. non-maintainer, and the distribution of how long an issue
# usually sits before it's at least commented on

glue(
    "scikit-learn_num_new_issues_responded",
    percent_val(len(commented_issues), len(newly_created_day_old))
)

glue("scikit-learn_new_issues_at_least_1_day_old", len(newly_created_day_old))
glue("scikit-learn_median_response_time", f"{median_time_til_first_response:1.0f}")

Of the 4579 issues that are at least 24 hours old, 4269 (93%) of them have been commented on. The median time until an issue is first responded to is 8 hours.

First responders#

	Contributor	# of times commented first
402	glemaitre	890
693	thomasjpfan	370
583	ogrisel	308
254	adrinjalali	261
154	NicolasHug	206
467	jnothman	190
498	lesteve	180
460	jeremiedbb	174
625	rth	116
297	betatim	104

Pull Requests#

Merged PRs over time#

A look at merged PRs over time.

Show code cell source Hide code cell source

# All contributors
merged_prs = [pr for pr in prs if pr['state'] == 'MERGED']
merge_dates = np.array([pr['mergedAt'] for pr in merged_prs], dtype=np.datetime64)
binsize = np.timedelta64(30, "D")
date_bins = np.arange(merge_dates[0], merge_dates[-1], binsize)
h_all, bedges = np.histogram(merge_dates, date_bins)
bcenters = bedges[:-1] + binsize / 2
smoothing_interval = 4  # in units of bin-width

# First-time contributors
first_time_contributor = []
prev_contrib = set()
for record in merged_prs:
    try:
        author = record['author']['login']
    except TypeError:  # Author no longer has GitHub account
        first_time_contributor.append(None)
        continue
    if author not in prev_contrib:
        first_time_contributor.append(True)
        prev_contrib.add(author)
    else:
        first_time_contributor.append(False)
# Object dtype for handling None
first_time_contributor = np.array(first_time_contributor, dtype=object)
# Focus on first time contributors
ftc_mask = first_time_contributor == True
ftc_dates = merge_dates[ftc_mask]

h_ftc, bedges = np.histogram(ftc_dates, date_bins)

fig, axes = plt.subplots(1, 2, figsize=(16, 8))
for ax, h, whom in zip(
    axes.ravel(), (h_all, h_ftc), ("all contributors", "first-time contributors")
):
    ax.bar(bcenters, h, width=binsize, label="Raw")
    ax.plot(
        bcenters,
        np.convolve(h, np.ones(smoothing_interval), 'same') / smoothing_interval,
        label=f"{binsize * smoothing_interval} moving average",
        color='tab:orange',
        linewidth=2.0,
    )

    ax.set_title(f'{whom}')
    ax.legend()

fig.suptitle("Merged PRs from:")
axes[0].set_xlabel('Time')
axes[0].set_ylabel(f'# Merged PRs / {binsize} interval')
axes[1].set_ylim(axes[0].get_ylim())
fig.autofmt_xdate()

# TODO: Replace this with `glue` once the glue:figure directive supports
# alt-text
import os
os.makedirs("thumbs", exist_ok=True)
plt.savefig("thumbs/scikit-learn.png", bbox_inches="tight")

/tmp/ipykernel_5088/1359898016.py:3: UserWarning: no explicit representation of timezones available for np.datetime64
  merge_dates = np.array([pr['mergedAt'] for pr in merged_prs], dtype=np.datetime64)

../../_images/d452d3777d4e52e8583198a833092d5224e5f9346b46bafdbd4ec3f2a869f78e.png

PR lifetime#

The following plot shows the “survival” of PRs over time. That means, the plot shows how many PRs are open for at least these many days. This is separated into PRs that are merged and those that are still open (closed but unmerged PRs are not included currently).

Show code cell source Hide code cell source

merged_prs = [pr for pr in prs if pr['state'] == 'MERGED']
lifetimes_merged = np.array(
    [isoparse(pr["mergedAt"]) - isoparse(pr["createdAt"]) for pr in merged_prs],
    dtype="m8[m]").view("int64") / (60 * 24)  # days
lifetimes_merged.sort()

#closed_prs = [pr for pr in prs if pr['state'] == 'CLOSED']
#lifetimes_closed = np.array(
#    [isoparse(pr["mergedAt"]) - isoparse(pr["createdAt"]) for pr in closed_prs],
#    dtype="m8[m]").view("int64") / (60 * 24)  # days
#lifetimes_closed.sort()


# Use the newest issue to guess a time when the data was generated.
# Can this logic be improved?
current_time = isoparse(max(iss["createdAt"] for iss in issues))

open_prs = [pr for pr in prs if pr['state'] == 'OPEN']
age_open = np.array(
    [current_time - isoparse(pr["createdAt"]) for pr in open_prs],
    dtype="m8[m]").view("int64") / (60 * 24)  # days
age_open.sort()

fig, ax = plt.subplots(figsize=(6, 4))
number_merged = np.arange(1, len(lifetimes_merged)+1)[::-1]
ax.step(lifetimes_merged, number_merged, label="Merged")

#ax.step(lifetimes_closed, np.arange(1, len(lifetimes_closed)+1)[::-1])

number_open = np.arange(1, len(age_open)+1)[::-1]
ax.step(age_open, number_open, label="Open")

# Find the first point where closed have a bigger survival than open PRs:
all_lifetimes = np.concatenate([lifetimes_merged, age_open])
all_lifetimes.sort()

number_merged_all_t = np.interp(all_lifetimes, lifetimes_merged, number_merged)
number_open_all_t = np.interp(all_lifetimes, age_open, number_open)

first_idx = np.argmax(number_merged_all_t < number_open_all_t)
first_time = all_lifetimes[first_idx]

ax.vlines(
        [first_time], 0, 1, transform=ax.get_xaxis_transform(), colors='k',
        zorder=0, linestyle="--")

ax.annotate(
    f"{round(first_time)} days",
    xy=(first_time, number_open_all_t[first_idx]),
    xytext=(5, 5), textcoords="offset points",
    va="bottom", ha="left")

ax.legend()
ax.set_xlabel("Time until merged or time open [days]")
ax.set_ylabel(r"# of PRs open this long or longer")
ax.set_xscale("log")
fig.autofmt_xdate()
fig.tight_layout();

../../_images/ec1567cfe80096ba092197496cedf67f43b789aed6e0c7f4069223407b48e7b7.png

Mergeability of Open PRs#

/opt/buildhome/.local/share/mise/installs/python/3.13.3/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3672: UserWarning: 

The data contains PRs with unknown merge status.
Please re-download the data to get accurate info about PR mergeability.
  exec(code_obj, self.user_global_ns, self.user_ns)

../../_images/6d8a0bc8676fb201a1e98d21a04bbd5fc5e9f20cc4d48cc411ddb0cf3163a9e5.png

Number of PR participants#

Where contributions come from#

There have been a total of 12943 merged PRs[1] submitted by 2935 unique authors. 1941 (66%) of these are “fly-by” PRs, i.e. PRs from users who have contributed to the project once (to-date).

Pony factor#

Another way to look at these data is in terms of the pony factor, described as:

The minimum number of contributors whose total contribution constitutes a majority of the contributions.

For this analysis, we will consider merged PRs as the metric for contribution. Considering all merged PRs over the lifetime of the project, the pony factor is: 34.

../../_images/8bba04a6ac120a6e61f3ab3727176e00bfcf16d7cbd5a5dec2619e794e5ac771.png

scikit-learn#

Issues#

New issues#

Time to response#

First responders#

Pull Requests#

Merged PRs over time#

PR lifetime#

Mergeability of Open PRs#

Number of PR participants#

Where contributions come from#

Pony factor#

This Page

`scikit-learn`#