SQLにおける3つの重複排除手法の概要

2022-01-05 08:03:54

SQLを使って数値を持ち上げているとき、uv（ユニークビジター）を求めるときなど、テーブルの中で重複した値に出会うことが多いので、重複排除をする必要があります。
MySQLでは通常distinct句またはgroup by句を使用しますが、ウィンドウ関数をサポートするSQL（Hive SQL、Oracleなど）では、row_numberウィンドウ関数を使用して重複排除を行うこともできます。

例として、テーブルタスクがあります。

<テーブルタスクID オーダーID 開始時間 1 123 2020-01-05 1 213 2020-01-06 1 321 2020-01-07 2 456 2020-01-06 2 465 2020-01-07 3 798 2020-01-06

注意事項

task_id: タスクID。
order_id: 注文ID。
開始時間：開始時間

注：1つのタスクは複数のオーダーに対応する

タスクの総数を求める必要がありますが、task_idは一意ではないので、重複を排除する必要があります。

明確

-- list all unique values of task_id (de-duplicated records)
-- select distinct task_id
-- from Task;

-- total number of tasks
select count(distinct task_id) task_num
from Task;

distinctは一般に効率が悪い。重複排除後の特定の値を表示するのには適しておらず、通常、エントリ数を計算するためにcountと組み合わせて使用されます。

distinctは、selectの後に続くすべてのフィールドの値を統一するために使用します。例えば、distinctの後に2つのフィールドがある場合、1,1行目と1,2行目は重複しないことになります。

によるグループ

-- list all unique values of task_id (de-duplicated records, null is also a value)
-- select task_id
-- from Task
-- group by task_id;

-- total number of tasks
select count(task_id) task_num
from (select task_id
      from Task
      group by task_id) tmp;

行番号

row_number はウィンドウ関数で、次のような構文になります。
row_number() over (partition by <field name for grouping> order by <field name for sorting within group>)
ここで、partition byの部分は省略可能です。

-- use in sql with window function support
select count(case when rn=1 then task_id else null end) task_num
from (select task_id
       , row_number() over (partition by task_id order by start_time) rn
   from Task) tmp;

さらに、テーブルテストを使って、重複排除でdistinctとgroup byがどのように使われるかを見てみましょう。

<テーブルユーザーID ユーザータイプ 1 1 1 2 2 1

-- semicolon below; to separate rows
select distinct user_id
from Test; -- returns 1; 2

select distinct user_id, user_type
from Test; -- returns 1, 1; 1, 2; 2, 1

select user_id
from Test
group by user_id; -- return 1; 2

select user_id, user_type
from Test
group by user_id, user_type; -- returns 1, 1; 1, 2; 2, 1

select user_id, user_type
from Test
group by user_id;    
-- Hive, Oracle, etc. will report an error, mysql can write it like this.
-- Returns 1, 1 or 1, 2 ; 2, 1 (two rows in total). Only the fields after group by will be de-duplicated, that is, the number of records returned at the end is equal to the number of records in the previous sql, i.e. 2
-- fields that are not placed after group by but are placed in select, only one row will be returned (it seems to be the first one, so there is no pattern)

今回の記事は、SQLの重複排除の3つの方法についてです。SQLの重複排除の詳細については、スクリプトハウスの過去記事を検索するか、以下の関連記事を引き続きご覧ください。

SQLにおける3つの重複排除手法の概要

明確

によるグループ

行番号

関連

SQLインジェクションとその防止、マイベイトの基本的な役割について

Windows環境でのSqlファイルの一括実行

SQL Server 2017がサーバーに接続できない問題解決

SQL Server2017では、IPをサーバー名としてサーバーに接続します。

DataGrip Formatting SQLの実装（カスタムSqlフォーマット）

SqlServerデータベースリモート接続ケースチュートリアル

SQL Server のジョブが失敗しました。所有者がサーバーアクセス権を持っているかどうか判断できない

SQLServerにおけるJSONドキュメント型データのクエリ問題を解決する。

SQLサーバーのデータベースで、SAユーザーがロックされている問題を解決する

SQLでのmod()関数の余りの使用法

最新

nginxです。[emerg] 0.0.0.0:80 への bind() に失敗しました (98: アドレスは既に使用中です)

htmlページでギリシャ文字を使うには

ピュアhtml+cssでの要素読み込み効果

純粋なhtml + cssで五輪を実現するサンプルコード

ナビゲーションバー・ドロップダウンメニューのHTML+CSSサンプルコード

タイピング効果を実現するピュアhtml+css

htmlの選択ボックスのプレースホルダー作成に関する質問

html css3 伸縮しない画像表示効果

トップナビゲーションバーメニュー作成用HTML+CSS

html+css 実装サイバーパンク風ボタン

おすすめ

SQL Server テーブルパーティション削除の詳細

SQL SERVERのストアドプロシージャを使用した履歴データの移行について

SQLServerクリーンアップログファイルのメソッド事例詳細

SQL Server一括挿入データ事例詳細

SQL ServerのSELECT INTOとINSERT INTOのSELECTのケースを説明する

mybatis動的SQL実装ロジックコード詳細

SQL SERVERのコミット・トランザクションのロールバック機構

SQL クエリ結果カラムのカンマ区切り文字列へのステッチング法

SQL SERVERオープンCDC実践講座詳細

sql serverで最初の1000行のデータを削除する方法の例