{"id":427,"date":"2017-09-01T13:14:37","date_gmt":"2017-09-01T05:14:37","guid":{"rendered":"http:\/\/vinta.ws\/code\/?p=427"},"modified":"2026-03-17T01:17:09","modified_gmt":"2026-03-16T17:17:09","slug":"play-with-github-archive-dataset-on-bigquery","status":"publish","type":"post","link":"https:\/\/vinta.ws\/code\/play-with-github-archive-dataset-on-bigquery.html","title":{"rendered":"Play with GitHub Archive Dataset on BigQuery"},"content":{"rendered":"<p>Google BigQuery is a web service that lets you do interactive analysis of massive datasets - analyzing billions of rows in seconds.<\/p>\n<p>ref:<br \/>\n<a href=\"https:\/\/www.githubarchive.org\/#bigquery\">https:\/\/www.githubarchive.org\/#bigquery<\/a><br \/>\n<a href=\"https:\/\/bigquery.cloud.google.com\/table\/githubarchive:month.201612\">https:\/\/bigquery.cloud.google.com\/table\/githubarchive:month.201612<\/a><\/p>\n<p>See also:<br \/>\n<a href=\"http:\/\/ghtorrent.org\/\">http:\/\/ghtorrent.org\/<\/a><\/p>\n<h2>Show repository informations (1)<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-sql\">WITH repo_info AS (\n  SELECT repo.id AS id, repo.name AS name, JSON_EXTRACT_SCALAR(payload, '$.pull_request.base.repo.description') AS description\n  FROM <code>githubarchive.month.2017*<\/code>\n  -- FROM <code>githubarchive.year.2016<\/code>\n  -- FROM <code>githubarchive.year.*<\/code>\n  WHERE type = \"PullRequestEvent\"\n)\n\nSELECT repo_info.name, ANY_VALUE(repo_info.description) AS description\nFROM repo_info\nWHERE\n  repo_info.description IS NOT NULL AND\n  repo_info.description != \"\"\nGROUP BY repo_info.name\nORDER BY repo_info.name<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/functions-and-operators#json-functions\">https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/functions-and-operators#json-functions<\/a><br \/>\n<a href=\"https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/functions-and-operators#any_value\">https:\/\/cloud.google.com\/bigquery\/docs\/reference\/standard-sql\/functions-and-operators#any_value<\/a><\/p>\n<h2>Show repository informations (2)<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-sql\">WITH repo_info AS (\n  SELECT repo.id AS id, repo.name AS name, JSON_EXTRACT_SCALAR(payload, '$.description') AS description\n  FROM <code>githubarchive.month.201708<\/code>\n  WHERE type = \"CreateEvent\"\n)\n\nSELECT repo_info.name, ANY_VALUE(repo_info.description) AS description\nFROM repo_info\nWHERE\n  repo_info.description IS NOT NULL AND\n  repo_info.description != \"\"\nGROUP BY repo_info.name\nORDER BY repo_info.name<\/code><\/pre>\n<h2>Show repository informations (3)<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-sql\">SELECT name, description\nFROM <code>ghtorrent-bq.ght_2017_04_01.projects<\/code>\nWHERE\n  forked_from IS NULL AND\n  description IS NOT NULL AND\n  description != \"\"<\/code><\/pre>\n<h2>Show starred repositories by a specific user<\/h2>\n<p>You must use <code>WatchEvent<\/code> for starring a repo:<br \/>\n<a href=\"https:\/\/developer.github.com\/v3\/activity\/events\/types\/#watchevent\">https:\/\/developer.github.com\/v3\/activity\/events\/types\/#watchevent<\/a><\/p>\n<pre class=\"line-numbers\"><code class=\"language-sql\">SELECT repo.name, created_at\nFROM TABLE_QUERY([githubarchive:month], 'LEFT(table_ID,4) IN (\"2017\",\"2016\",\"2015\")') \nWHERE type = \"WatchEvent\" AND actor.login = 'vinta'\nGROUP BY repo.name, created_at\nORDER BY created_at DESC<\/code><\/pre>\n<h2>Show starred repositories per user who has 10+ starred repositories<\/h2>\n<pre class=\"line-numbers\"><code class=\"language-sql\">WITH stars AS (\n     SELECT DISTINCT actor.login AS user, repo.name AS repo\n     FROM <code>githubarchive.month.2017*<\/code>\n     WHERE type=\"WatchEvent\"\n),\nrepositories_stars AS (\n     SELECT repo, COUNT(*) as c FROM stars GROUP BY repo\n     ORDER BY c DESC\n     LIMIT 1000\n),\nusers_stars AS (\n    SELECT user, COUNT(*) as c FROM  stars\n    WHERE repo IN (SELECT repo FROM repositories_stars)\n    GROUP BY user\n    HAVING c &gt;= 10\n    LIMIT 10000\n)\nSELECT user, repo FROM stars\nWHERE repo IN (SELECT repo FROM repositories_stars)\nAND user IN (SELECT user FROM users_stars)<\/code><\/pre>\n<p>ref:<br \/>\n<a href=\"https:\/\/gist.github.com\/jbochi\/2e8ddcc5939e70e5368326aa034a144e\">https:\/\/gist.github.com\/jbochi\/2e8ddcc5939e70e5368326aa034a144e<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Google BigQuery is a web service that lets you do interactive analysis of very massive datasets - analyzing billions of rows in seconds.<\/p>\n","protected":false},"author":1,"featured_media":428,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[112],"tags":[10,115,114],"class_list":["post-427","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-about-big-data","tag-github","tag-google-bigquery","tag-google-cloud-platform"],"_links":{"self":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/427","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/comments?post=427"}],"version-history":[{"count":0,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/posts\/427\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media\/428"}],"wp:attachment":[{"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/media?parent=427"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/categories?post=427"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/vinta.ws\/code\/wp-json\/wp\/v2\/tags?post=427"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}