重点内容几种常用的合并数据集的方法: 1) pd.concat ([data1,data2,…], axis=1(or 0), keys=[‘key1’, ‘key2’,…], names=[‘upper’,’lower’,…], ignore_index=True/False,…)
2) pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=(‘_x’, ‘_y’), copy=True, indicator=False, validate=None)
3) data1.append(data2): 不推荐用,用pd.concat替代
举例: 数据集data1和data2:
data1: subject_id first_name last_name 0 4 Billy Bonder 1 5 Brian Black 2 6 Bran Balwner 3 7 Bryce Brice 4 8 Betty Btisan data1: subject_id first_name last_name 0 1 Alex Anderson 1 2 Amy Ackerman 2 3 Allen Ali 3 4 Alice Aoni 4 5 Ayoung Atiches方法和参数选择: - pd.merge(): 1) pd.merge(data1,data2, how=’inner’,on=’subject_id’)
pd.merge(data1,data2, how='inner',on='subject_id') Out[31]: subject_id first_name_x last_name_x first_name_y last_name_y 0 4 Alice Aoni Billy Bonder 1 5 Ayoung Atiches Brian Black2) pd.merge(data1,data2, how=’right’,on=’subject_id’)
pd.merge(data1,data2, how='right',on='subject_id') Out[34]: subject_id first_name_x last_name_x first_name_y last_name_y 0 4 Alice Aoni Billy Bonder 1 5 Ayoung Atiches Brian Black 2 6 NaN NaN Bran Balwner 3 7 NaN NaN Bryce Brice 4 8 NaN NaN Betty Btisan- pd.concat(): 1) pd.concat([data1,data2],axis=1,ignore_index=True)
all_data_col = pd.concat([data1,data2],axis=1,ignore_index=True) all_data_col Out[40]: 0 1 2 3 4 5 0 1 Alex Anderson 4 Billy Bonder 1 2 Amy Ackerman 5 Brian Black 2 3 Allen Ali 6 Bran Balwner 3 4 Alice Aoni 7 Bryce Brice 4 5 Ayoung Atiches 8 Betty Btisan2) pd.concat([data1,data2],axis=0)
pd.concat([data1,data2],axis=0) all_data_col Out[38]: subject_id first_name last_name 0 1 Alex Anderson 1 2 Amy Ackerman 2 3 Allen Ali 3 4 Alice Aoni 4 5 Ayoung Atiches 0 4 Billy Bonder 1 5 Brian Black 2 6 Bran Balwner 3 7 Bryce Brice 4 8 Betty Btisan-------- data1.append(data2): all_data = data1.append(data2) all_data Out[14]: subject_id first_name last_name 0 1 Alex Anderson 1 2 Amy Ackerman 2 3 Allen Ali 3 4 Alice Aoni 4 5 Ayoung Atiches 0 4 Billy Bonder 1 5 Brian Black 2 6 Bran Balwner 3 7 Bryce Brice 4 8 Betty Btisan Less efficient: >>> df = pd.DataFrame(columns=['A']) >>> for i in range(5): ... df = df.append({'A'}: i}, ignore_index=True) >>> df A 0 0 1 1 2 2 3 3 4 4 More efficient: >>> pd.concat([pd.DataFrame([i], columns=['A']) for i in range(5)],ignore_index=True) A 0 0 1 1 2 2 3 3 4 4help()代码信息:
concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True) Concatenate pandas objects along a particular axis with optional set logic along the other axes. Parameters ---------- objs : a sequence or mapping of Series, DataFrame, or Panel objects If a dict is passed, the sorted keys will be used as the `keys` argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised axis : {0/'index', 1/'columns'}, default 0 The axis to concatenate along join : {'inner', 'outer'}, default 'outer' How to handle indexes on other axis(es) join_axes : list of Index objects Specific indexes to use for the other n - 1 axes instead of performing inner/outer set logic ignore_index : boolean, default False If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, ..., n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join. keys : sequence, default None If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level levels : list of sequences, default None Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys names : list, default None Names for the levels in the resulting hierarchical index verify_integrity : boolean, default False Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation copy : boolean, default True If False, do not copy data unnecessarily Returns ------- concatenated : object, type of objs When concatenating all ``Series`` along the index (axis=0), a ``Series`` is returned. When ``objs`` contains at least one ``DataFrame``, a ``DataFrame`` is returned. When concatenating along the columns (axis=1), a ``DataFrame`` is returned. merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None) Merge DataFrame objects by performing a database-style join operation by columns or indexes. If joining columns on columns, the DataFrame indexes *will be ignored*. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. Parameters ---------- left : DataFrame right : DataFrame how : {'left', 'right', 'outer', 'inner'}, default 'inner' * left: use only keys from left frame, similar to a SQL left outer join; preserve key order * right: use only keys from right frame, similar to a SQL right outer join; preserve key order * outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically * inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys on : label or list Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default. left_on : label or list, or array-like Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns right_on : label or list, or array-like Field names to join on in right DataFrame or vector/list of vectors per left_on docs left_index : boolean, default False Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels right_index : boolean, default False Use the index from the right DataFrame as the join key. Same caveats as left_index sort : boolean, default False Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword) suffixes : 2-length sequence (tuple, list, ...) Suffix to apply to overlapping column names in the left and right side, respectively copy : boolean, default True If False, do not copy data unnecessarily indicator : boolean or string, default False If True, adds a column to output DataFrame called "_merge" with information on the source of each row. If string, column with information on source of each row will be added to output DataFrame, and column will be named value of string. Information column is Categorical-type and takes on a value of "left_only" for observations whose merge key only appears in 'left' DataFrame, "right_only" for observations whose merge key only appears in 'right' DataFrame, and "both" if the observation's merge key is found in both.