Title: | Column Profile for Tables and Datasets |
---|---|
Description: | Profiles datasets (collecting statistics and informative summaries about that data) on data frames and 'ODBC' tables: maximum, minimum, mean, standard deviation, nulls, distinct values, data patterns, data/format frequencies. |
Authors: | Arnaldo Vitaliano [aut, cre] |
Maintainer: | Arnaldo Vitaliano <[email protected]> |
License: | GPL-3 | file LICENSE |
Version: | 0.1.0 |
Built: | 2025-01-27 03:40:04 UTC |
Source: | https://github.com/avitaliano/datrprofile |
buildQueryColumnFrequency
buildQueryColumnFrequency(conn.info, ...)
buildQueryColumnFrequency(conn.info, ...)
conn.info |
Connection info created with |
... |
Other parameters |
query column, count(*) from table
buildQueryColumnMetadata
buildQueryColumnMetadata(conn.info, ...)
buildQueryColumnMetadata(conn.info, ...)
conn.info |
Connection info created with |
... |
Other params |
query columns' metadata
buildQueryColumnStats
buildQueryColumnStats(conn.info, ...)
buildQueryColumnStats(conn.info, ...)
conn.info |
Connection info created with |
... |
Other parameters |
query count(distinct column) from table
buildQueryColumnStats.sqlite
## S3 method for class 'sqlite' buildQueryColumnStats(conn.info, schema, table, column, query.filter, ...)
## S3 method for class 'sqlite' buildQueryColumnStats(conn.info, schema, table, column, query.filter, ...)
conn.info |
Connection info created with |
schema |
Table Schema |
table |
Table Name |
column |
Column profiled |
query.filter |
Filter applied to the profile |
... |
Other parameters |
query count(distinct column) from table
buildQueryCountNull
buildQueryCountNull(conn.info, ...)
buildQueryCountNull(conn.info, ...)
conn.info |
Connection info created with |
... |
Other parameters |
query select count(*) where collumn is null
Count total rows from table.
buildQueryCountTotal(conn.info, ...)
buildQueryCountTotal(conn.info, ...)
conn.info |
Connection info created with |
... |
Other params |
query count(*) from table
buildQueryProfileColumnFormatFrequency
buildQueryProfileColumnFormatFrequency(conn.info, ...)
buildQueryProfileColumnFormatFrequency(conn.info, ...)
conn.info |
Connection info created with |
... |
Other parameters |
queries column format frequency from table
Disconnects from database using odbc::dbDisconnect
closeConnection(conn)
closeConnection(conn)
conn |
Connection created at |
TRUE
if succeeded at closing connection
Connects to database using dbConnect
connectDB(conn.info, ...)
connectDB(conn.info, ...)
conn.info |
Connection info created at |
... |
Other parameters |
connection
to database
Connects to database using dbConnect
## Default S3 method: connectDB(conn.info, ...)
## Default S3 method: connectDB(conn.info, ...)
conn.info |
Connection info created at |
... |
Other parameters |
connection
to database
Connects to database using dbConnect
## S3 method for class 'sqlite' connectDB(conn.info, ...)
## S3 method for class 'sqlite' connectDB(conn.info, ...)
conn.info |
Connection info created at |
... |
Other parameters |
connection
to database
Issues query against the RDBS to retrieve information about each column of the table. Name, type, length, precision, etc.
getTableColumns(conn.info, schema, table)
getTableColumns(conn.info, schema, table)
conn.info |
Connection info created with |
schema |
Table schema |
table |
Table name |
data frame containing the columns' metadata
prepareConnection
list connection details needed to connecto
to a RDBS using ODBC
prepareConnection(db.vendor, odbc.driver = odbc::odbc(), db.host = NULL, db.name = NULL, db.encoding = "", dsn = NULL, user = NULL, passwd = NULL)
prepareConnection(db.vendor, odbc.driver = odbc::odbc(), db.host = NULL, db.name = NULL, db.encoding = "", dsn = NULL, user = NULL, passwd = NULL)
db.vendor |
Database vendor (teradata, sqlserver) |
odbc.driver |
ODBC driver used to connect to database |
db.host |
Database hostname |
db.name |
Database name |
db.encoding |
Database encoding |
dsn |
Data source name |
user |
Username to connect to database |
passwd |
Password to connect to database |
conn.info <- prepareConnection(db.vendor = "teradata", dsn = "ODBC_MYDB", user = "myuser", passwd = "mypasswd")
conn.info <- prepareConnection(db.vendor = "teradata", dsn = "ODBC_MYDB", user = "myuser", passwd = "mypasswd")
Print method
## S3 method for class 'profile' print(x, ...)
## S3 method for class 'profile' print(x, ...)
x |
profile object |
... |
other parameters |
printed profile
profileColumn
profileColumn(conn.info, schema, table, column, column.datatype, query.filter, limit.freq.values = 30, format.show.percentage)
profileColumn(conn.info, schema, table, column, column.datatype, query.filter, limit.freq.values = 30, format.show.percentage)
conn.info |
Connection info created with |
schema |
Table schema |
table |
Table name |
column |
Column being profiled |
column.datatype |
Column datatype |
query.filter |
Filter applied before profile the column |
limit.freq.values |
Distinct values shown in frequency data frame |
format.show.percentage |
Threshold considered when showing formats' percentages |
columnProfile <- list(column, count.total, count.distinct, perc.distinct, count.null, perc.null, min.value, max.value, column.freq, format.freq = format.freq)
Profiles column based on its format, using masking strategy. X = char, 9 = digit, S = symbol
profileColumnFormat(conn.info, column, column.datatype, schema, table, count.total, query.filter, format.show.percentage)
profileColumnFormat(conn.info, column, column.datatype, schema, table, count.total, query.filter, format.show.percentage)
conn.info |
Connection info created with |
column |
Column name that will be profiled |
column.datatype |
Column datatipe |
schema |
Table schema |
table |
Table name |
count.total |
Number of rows to be profiled |
query.filter |
Filter applied to the table, when profilling |
format.show.percentage |
Threshold considered when showing formats' percentages |
Data Frame containing columns' metadata
Profiles tables and dataframes (collecting statistics and informative summaries about that data): max, min, avg, sd, nulls, distinct values, data patterns, data/format frequencies.
runProfile(conn.info, schema = NULL, table, is.parallel = TRUE, count.nodes, query.filter = NA, format.show.percentage = 0.03)
runProfile(conn.info, schema = NULL, table, is.parallel = TRUE, count.nodes, query.filter = NA, format.show.percentage = 0.03)
conn.info |
Connection info created with |
schema |
Table schema |
table |
Table name |
is.parallel |
Boolean that indicates if profile will run in parallel. Default TRUE. |
count.nodes |
Number of nodes used when is.parallel = TRUE |
query.filter |
Filter applied to the table, when profilling |
format.show.percentage |
Threshold considered when showing formats percentages |
profile results for the table/dataframe
Override summary function
## S3 method for class 'profile' summary(object, ...)
## S3 method for class 'profile' summary(object, ...)
object |
Profile object |
... |
other parameters |
data.frame with summary information