Python pandas: DataFrame

When constructing a DataFrame, index selects the rows

The Series object has an index that assigns a value (label) to its elements.

When constructing a DataFrame object, the parameter index can be passed a list of labels with which the elements for the rows are selected:

import pandas as pd

s1 = pd.Series ( [  10 ,  20 ,  30 ,  40 ,  50 ],
        index =  ['abc','def','ghi','jkl','mno'])


s2 = pd.Series ( [  1.1,       3.3 , 4.4 , 5.5 , 6.6 , 7.7 ],
        index =  ['abc',      'ghi','jkl','mno','pqr','stu'])


s3 = pd.Series ( [ 111 , 222 , 333 , 444 , 555 , 6.6 , 7.7 ],
        index =  ['jkl','def','vwx','ghi','abc','stu','pqr'])

df = pd.DataFrame(
      { 'col_1': s1,
        'col_2': s2,
        'col_3': s3
      },
      index = ['abc', 'def', 'jkl', 'stu']
)

print(df)
#
#       col_1  col_2  col_3
#  abc   10.0    1.1  555.0
#  def   20.0    NaN  222.0
#  jkl   40.0    4.4  111.0
#  stu    NaN    7.7    6.6

Column selection

A data frame's column can be selected with df['column-name'] (very similar to a dict). This syntax returns a Series.

Multiple columns can be selected with df[ ['col-three', 'col-nine' …]]. This syntax returns a DataFrame.

import pandas as pd
  
df = pd.DataFrame ({
           'x': [7, 2, 3    ],
           'y': [4, 4, 5    ],
           'z': [9, 6, 5    ]
       },
    )

print(type(df['x']))
print(     df['x'] )

print(type(df[ ['y', 'z']]))
print(     df[ ['y', 'z']] )

Row selection

Rows can be selected by using a condition like so: df[ df['col'] > x ]

Multiple conditions are and'ed or or'ed with & and | (trying to use and or or results in the Value error The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()

import pandas as pd

fruits = pd.DataFrame(
   {
    'price': [0.50, 1.25, 2.00, 0.75, 1.50, 0.80, 2.50, 0.60, 3.00],
    'color': ['red', 'green', 'orange', 'yellow', 'purple', 'brown', 'black', 'pink', 'white'],
    'avg_weight': [100, 150, 200, 50, 75, 300, 120, 90, 175],
    'sweetness': [4, 7, 6, 3, 9, 2, 5, 8, 7],
    'shelf_life': [5, 3, 7, 2, 6, 8, 4, 9, 10],
   },
   index =['apple', 'grapes', 'orange', 'lemon', 'banana', 'pear', 'blackberry', 'watermelon', 'kiwi']
)

print(fruits[ fruits['sweetness'] > 6 ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  grapes       1.25   green         150          7           3
#  banana       1.50  purple          75          9           6
#  watermelon   0.60    pink          90          8           9
#  kiwi         3.00   white         175          7          10

print(fruits[ (fruits['sweetness' ] > 6)    &
              (fruits['shelf_life'] > 7) ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  watermelon    0.6   pink          90          8           9
#  kiwi          3.0  white         175          7          10

print(fruits[ (fruits['sweetness' ] > 8)    |
              (fruits['shelf_life'] > 7) ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  banana        1.5  purple          75          9           6
#  pear          0.8   brown         300          2           8
#  watermelon    0.6    pink          90          8           9
#  kiwi          3.0   white         175          7          10


print(fruits[ fruits['color'].isin( ['yellow', 'green', 'red' ]) ])
#
#          price   color  avg_weight  sweetness  shelf_life
#  apple    0.50     red         100          4           5
#  grapes   1.25   green         150          7           3
#  lemon    0.75  yellow          50          3           2

Select rows using the index:

print(fruits.loc[ ['lemon', 'grapes'] ])
#
#          price   color  avg_weight  sweetness  shelf_life
#  lemon    0.75  yellow          50          3           2
#  grapes   1.25   green         150          7           3

Select rows using their (zero based) «row-number»:

print(fruits.iloc[ [2,3,5,7] ])
#
#              price   color  avg_weight  sweetness  shelf_life
#  orange       2.00  orange         200          6           7
#  lemon        0.75  yellow          50          3           2
#  pear         0.80   brown         300          2           8
#  watermelon   0.60    pink          90          8           9

group by

import pandas as pd
import numpy  as np

np.random.seed(2807)

df = pd.DataFrame({
    'gender': np.random.choice(['M', 'F'], size=10),
    'age'   : np.random.normal(40, 10, size=10).astype(int),
    'income': np.random.normal(75000, 15000, size=10).astype(int)
   },
   index = ['Alice', 'Bob', 'Charlie', 'David', 'Emily', 'Frank', 'Gina', 'Harry', 'Isabelle', 'Jake']
)

print(df.groupby('gender')['income'].mean())
#
#  gender
#  F    80786.333333
#  M    72413.714286
#  Name: income, dtype: float64

read and write methods

The pandas module has a few read_…() functions which construct a data frame from data in various formats.

The data of a data frame can be written (persisted) to another format with one of the to_…() methods.

fillna

import pandas

df = pandas.DataFrame(
      { 'col_1': pandas.Series( [   10, None,   30, None ] ),
        'col_2': pandas.Series( [ None,  'Y', None,  'Q' ] )
      })


df.col_1 = df.col_1.fillna( 999 )

print(df)
#    col_1 col_2
# 0     10  None
# 1    999     Y
# 2     30  None
# 3    999     Q

df.col_2.fillna ( '?', inplace=True )

print(df)
#    col_1 col_2
# 0     10     ?
# 1    999     Y
# 2     30     ?
# 3    999     Q

Github repository about-python, path: /libraries/pandas/DataFrame/fillna.py

modify column

import pandas

df = pandas.DataFrame(
      { 'col_1': pandas.Series( [  10 ,  20 ,  30 ] ),
        'col_2': pandas.Series( ['foo','bar','baz'] ),
        'col_3': pandas.Series( [  'X',  'Y',  'Z'] )
      })


df.col_1 = df.col_1.map(lambda n: n * n)

print(df)
#    col_1 col_2 col_3
# 0    100   foo     X
# 1    400   bar     Y
# 2    900   baz     Z

Github repository about-python, path: /libraries/pandas/DataFrame/modify-column.py

pivot table

import pandas
import numpy

df = pandas.DataFrame( {
        'c1'   : pandas.Series( ['foo','bar','baz',    'foo','bar','baz',    'bar',   'foo','foo'] ),
        'c2'   : pandas.Series( [  'X',  'Y',  'Z',      'X',  'Y',  'Z',      'X',     'Z',  'Z'] ),
        'vals' : pandas.Series( [  10 ,  20 ,  30 ,       1 ,   2 ,   3 ,      17 ,    100 ,  50 ] )
      })

piv = df.pivot_table(
            index  =['c1'  ], 
            columns=['c2'  ],
            values =['vals'],
            aggfunc=numpy.sum)

print(type(piv))
# <class 'pandas.core.frame.DataFrame'>

print(piv)
#      vals
# c2      X   Y    Z
# c1
# bar    17  22  NaN
# baz   NaN NaN   33
# foo    11 NaN  150

Github repository about-python, path: /libraries/pandas/DataFrame/pivot_table.py

Members of a DataFrame

Filter methods

Filter methods return an array (of the same length as the data frame) of booleans (True or False). Such an array is used to select rows.

`isna()`
`isnull()`
`isin()`
`notna()`
`notnull()`

Statistical/summary methods

`max()`
`mean()`
`median()`
`min()`
`sum()`

to_ methods

A data frame's to_…() method converts the data of a data frame to another format.

`to_clipboard()`
`to_csv()`
`to_dict()`
`to_excel()`
`to_feather()`
`to_gbq()`
`to_hdf()`
`to_html()`
`to_json()`
`to_latex()`
`to_markdown()`
`to_numpy()`
`to_orc()`
`to_parquet()`
`to_period()`
`to_pickle()`
`to_records()`
`to_sql()`
`to_stata()`
`to_string()`
`to_xml()`

The corresponding read_…() methods are found in the pandas module.

_repr_ methods

_repr_… methods make rich display possible in IPython.

`_repr_html_()`
`_repr_latex_()`

Other methods

`abs()`
`_accessors`	`set`
`_accum_func()`
`add()`
`_add_numeric_operations()`
`add_prefix()`
`add_suffix()`
`agg()`
`_agg_by_level()`
`_agg_examples_doc`	`str`
`aggregate()`
`_agg_summary_and_see_also_doc`	`str`
`align()`
`_align_frame()`
`_align_series()`
`all()`
`any()`
`_append()`
`append()`
`apply()`
`applymap()`
`_arith_method()`
`asfreq()`
`_as_manager()`
`asof()`
`assign()`
`astype()`
`at`	`pandas.core.indexing._AtIndexer`
`at_time()`
`_attrs`	`dict`
`attrs`	`dict`
`axes`	`list`
`_AXIS_LEN`	`int`
`_AXIS_ORDERS`	`list`
`_AXIS_TO_AXIS_NUMBER`	`dict`
`backfill()`
`between_time()`
`bfill()`
`bool()`
`_box_col_values()`
`boxplot()`
`_can_fast_transpose`	`bool`
`_check_inplace_and_allows_duplicate_labels()`
`_check_inplace_setting()`
`_check_is_chained_assignment_possible()`
`_check_label_or_level_ambiguity()`
`_check_setitem_copy()`
`_clear_item_cache()`
`clip()`
`_clip_with_one_bound()`
`_clip_with_scalar()`
`_cmp_method()`
`columns`	`pandas.core.indexes.base.Index`
`combine()`
`combine_first()`
`_combine_frame()`
`compare()`
`_consolidate()`
`_consolidate_inplace()`
`_construct_axes_dict()`
`_construct_axes_from_arguments()`
`_constructor`	`type`
`_constructor_sliced`	`type`
`_construct_result()`
`_convert()`
`convert_dtypes()`
`copy()`
`corr()`
`corrwith()`
`count()`
`_count_level()`
`cov()`
`cummax()`
`cummin()`
`cumprod()`
`cumsum()`
`_data`	`pandas.core.internals.managers.BlockManager`
`describe()`	Returns a `DataFrame` with statistical data about the columns with numerical values. The row-indexes are `count`, `mean`, `std`, `min`, `25%`, `50%`, `75%` and `max. The column (names) of the returned` DataFrame `correspond to the` DataFrame `being described. Compare with` info()`
`diff()`
`_dir_additions()`
`_dir_deletions()`
`_dispatch_frame_op()`
`div()`
`divide()`
`dot()`
`drop()`
`_drop_axis()`
`drop_duplicates()`
`_drop_labels_or_levels()`
`droplevel()`
`dropna()`
`dtypes`	`pandas.core.series.Series`
`duplicated()`
`empty`	`bool`
`_ensure_valid_index()`
`eq()`
`equals()`
`eval()`
`ewm()`
`expanding()`
`explode()`
`ffill()`
`fillna()`
`filter()`
`_find_valid_index()`
`first()`
`first_valid_index()`
`_flags`	`pandas.core.flags.Flags`
`flags`	`pandas.core.flags.Flags`
`floordiv()`
`_from_arrays()`
`from_dict()`
`from_records()`
`ge()`
`get()`
`_get_agg_axis()`
`_get_axis()`
`_get_axis_name()`
`_get_axis_number()`
`_get_axis_resolvers()`
`_get_block_manager_axis()`
`_get_bool_data()`
`_get_cleaned_column_resolvers()`
`_get_column_array()`
`_get_index_resolvers()`
`_getitem_bool_array()`
`_get_item_cache()`
`_getitem_multilevel()`
`_get_label_or_level_values()`
`_get_numeric_data()`
`_get_value()`
`_gotitem()`
`groupby()`	Returns a `pandas.core.groupby.generic.DataFrameGroupBy` object. After grouping a data frame, statisical methods can be applied to each group. See also here here
`gt()`
`_HANDLED_TYPES`	`tuple`
`head()`	Returns a data frame with the first `n` rows of a data frame (similar in spirit to the shell command `head`). See also `tail()`
`_hidden_attrs`	`frozenset`
`hist()`
`iat`	`pandas.core.indexing._iAtIndexer`
`id`	`pandas.core.series.Series`
`idxmax()`
`idxmin()`
`iloc[]`	Returns the rows by numerical position. Compare with `loc[]`
`index`	`pandas.core.indexes.range.RangeIndex`
`_indexed_same()`
`infer_objects()`
`info()`	Some internal information about the data frame (number of columns, index type, memory usage etc.) Compare with `describe()`
`_info_axis`	`pandas.core.indexes.base.Index`
`_info_axis_name`	`str`
`_info_axis_number`	`int`
`_info_repr()`
`_init_mgr()`
`_inplace_method()`
`insert()`
`_internal_names`	`list`
`_internal_names_set`	`set`
`interpolate()`
`_is_copy`	`NoneType`
`_iset_item()`
`isetitem()`
`_iset_item_mgr()`
`_iset_not_inplace()`
`_is_homogeneous_type`	`bool`
`_is_label_or_level_reference()`
`_is_label_reference()`
`_is_level_reference()`
`_is_mixed_type`	`bool`
`_is_view`	`bool`
`_item_cache`	`dict`
`items()`
`_iter_column_arrays()`
`iteritems()`
`iterrows()`
`itertuples()`
`_ixs()`
`join()`
`_join_compat()`
`keys()`
`kurt()`
`kurtosis()`
`last()`
`last_valid_index()`
`le()`
`loc[]`	Select rows by their index. Compare with `iloc[]`.
`_logical_func()`
`_logical_method()`
`lookup()`
`lt()`
`mad()`
`mask()`
`_maybe_cache_changed()`
`_maybe_update_cacher()`
`melt()`	The opposite is `pivot()`
`memory_usage()`
`merge()`
`_metadata`	`list`
`_mgr`	`pandas.core.internals.managers.BlockManager`
`_min_count_stat_function()`
`mod()`
`mode()`
`mul()`
`multiply()`
`ndim`	`int`. Compare with `shape` and `size`
`ne()`
`_needs_reindex_multi()`
`nlargest()`
`nsmallest()`
`nunique()`
`pad()`
`pct_change()`
`pipe()`
`pivot()`	The opposite is `melt()`
`pivot_table()`
`plot()`	Plots (the `Series` objects in the `DataFrame`?) or the `DataFrame` itslf on the backend specified by the `plotting.backend` option (whose default is matplotlib).
`pop()`
`pow()`
`prod()`
`product()`
`_protect_consolidate()`
`quantile()`
`query()`
`radd()`
`rank()`
`rdiv()`
`_reduce()`
`_reduce_axis1()`
`reindex()`
`_reindex_axes()`
`_reindex_columns()`
`_reindex_index()`
`reindex_like()`
`_reindex_multi()`
`_reindex_with_indexers()`
`_rename()`
`rename()`
`rename_axis()`
`reorder_levels()`
`replace()`
`_replace_columnwise()`
`_repr_data_resource_()`
`_repr_fits_horizontal_()`
`_repr_fits_vertical_()`
`resample()`
`_reset_cache()`
`_reset_cacher()`
`reset_index()`
`rfloordiv()`
`rmod()`
`rmul()`
`rolling()`
`round()`
`rpow()`
`rsub()`
`rtruediv()`
`sample()`
`_sanitize_column()`
`select_dtypes()`
`sem()`
`_series`	`dict`
`_set_axis()`
`set_axis()`
`_set_axis_name()`
`_set_axis_nocheck()`
`set_flags()`
`set_index()`
`_set_is_copy()`
`_set_item()`
`_setitem_array()`
`_setitem_frame()`
`_set_item_frame_value()`
`_set_item_mgr()`
`_setitem_slice()`
`_set_value()`
`shape`	A `tuple` which describes the data frame's dimensionality. Compare with `size` and `ndim`
`shift()`
`size`	`numpy.int64`. Compare with `shape` and `ndim`.
`skew()`
`_slice()`
`slice_shift()`
`sort_index()`
`sort_values()`
`squeeze()`
`stack()`
`_stat_axis`	`pandas.core.indexes.range.RangeIndex`
`_stat_axis_name`	`str`
`_stat_axis_number`	`int`
`_stat_function()`
`_stat_function_ddof()`
`std()`
`style`	Returns a `Styler` object (which has a `_repr_html_()` method, which makes it poosible for the data frame to be rendered in a Jupyter Notebook)
`sub()`
`subtract()`
`swapaxes()`
`swaplevel()`
`T`	`pandas.core.frame.DataFrame`
`tail()`	Returns a data frame with the last `n` rows of a data frame (similar in spirit to the shell command `tail`). See also `head()`
`_take()`
`take()`
`_take_with_is_copy()`
`_to_dict_of_blocks()`
`to_timestamp()`
`to_xarray()`
`transform()`
`transpose()`
`truediv()`
`truncate()`
`txt`	`pandas.core.series.Series`
`_typ`	`str`
`tz_convert()`
`tz_localize()`
`unstack()`
`update()`
`_update_inplace()`
`_validate_dtype()`
`value_counts()`
`_values`	`numpy.ndarray`
`values`	`numpy.ndarray`
`var()`
`_where()`
`where()`
`xs()`