Spiel-Master
Spezialisierter KI-Assistent: {name}. an AI playing an Akinator-style guessing game.
Erstellt einen klaren Bildprompt mit Stil-, Licht- und Kompositionsvorgaben, damit du schneller konsistente Visuals fΓΌr Social, Ads oder kreative Proj
A comprehensive reference for base R programming β covering data structures, control flow, functions, I/O, statistical computing, and plotting.
# Vectors (atomic)
x <- c(1, 2, 3) # numeric
y <- c("a", "b", "c") # character
z <- c(TRUE, FALSE, TRUE) # logical
# Factor
f <- factor(c("low", "med", "high"), levels = c("low", "med", "high"), ordered = TRUE)
# Matrix
m <- matrix(1:6, nrow = 2, ncol = 3)
m[1, ] # first row
m[, 2] # second column
# List
lst <- list(name = "ali", scores = c(90, 85), passed = TRUE)
lst$name # access by name
lst[[2]] # access by position
# Data frame
df <- data.frame(
id = 1:3,
name = c("a", "b", "c"),
value = c(10.5, 20.3, 30.1),
stringsAsFactors = FALSE
)
df[df$value > 15, ] # filter rows
df$new_col <- df$value * 2 # add column
# Vectors
x[1:3] # by position
x[c(TRUE, FALSE)] # by logical
x[x > 5] # by condition
x[-1] # exclude first
# Data frames
df[1:5, ] # first 5 rows
df[, c("name", "value")] # select columns
df[df$value > 10, "name"] # filter + select
subset(df, value > 10, select = c(name, value))
# which() for index positions
idx <- which(df$value == max(df$value))
# if/else
if (x > 0) {
"positive"
} else if (x == 0) {
"zero"
} else {
"negative"
}
# ifelse (vectorized)
ifelse(x > 0, "pos", "neg")
# for loop
for (i in seq_along(x)) {
cat(i, x[i], "\n")
}
# while
while (condition) {
# body
if (stop_cond) break
}
# switch
switch(type,
"a" = do_a(),
"b" = do_b(),
stop("Unknown type")
)
# Define
my_func <- function(x, y = 1, ...) {
result <- x + y
return(result) # or just: result
}
# Anonymous functions
sapply(1:5, function(x) x^2)
# R 4.1+ shorthand:
sapply(1:5, \(x) x^2)
# Useful: do.call for calling with a list of args
do.call(paste, list("a", "b", sep = "-"))
# sapply β simplify result to vector/matrix
sapply(lst, length)
# lapply β always returns list
lapply(lst, function(x) x[1])
# vapply β like sapply but with type safety
vapply(lst, length, integer(1))
# apply β over matrix margins (1=rows, 2=cols)
apply(m, 2, sum)
# tapply β apply by groups
tapply(df$value, df$group, mean)
# mapply β multivariate
mapply(function(x, y) x + y, 1:3, 4:6)
# aggregate β like tapply for data frames
aggregate(value ~ group, data = df, FUN = mean)
paste("a", "b", sep = "-") # "a-b"
paste0("x", 1:3) # "x1" "x2" "x3"
sprintf("%.2f%%", 3.14159) # "3.14%"
nchar("hello") # 5
substr("hello", 1, 3) # "hel"
gsub("old", "new", text) # replace all
grep("pattern", x) # indices of matches
grepl("pattern", x) # logical vector
strsplit("a,b,c", ",") # list("a","b","c")
trimws(" hi ") # "hi"
tolower("ABC") # "abc"
# CSV
df <- read.csv("data.csv", stringsAsFactors = FALSE)
write.csv(df, "output.csv", row.names = FALSE)
# Tab-delimited
df <- read.delim("data.tsv")
# General
df <- read.table("data.txt", header = TRUE, sep = "\t")
# RDS (single R object, preserves types)
saveRDS(obj, "data.rds")
obj <- readRDS("data.rds")
# RData (multiple objects)
save(df1, df2, file = "data.RData")
load("data.RData")
# Connections
con <- file("big.csv", "r")
chunk <- readLines(con, n = 100)
close(con)
# Scatter
plot(x, y, main = "Title", xlab = "X", ylab = "Y",
pch = 19, col = "steelblue", cex = 1.2)
# Line
plot(x, y, type = "l", lwd = 2, col = "red")
lines(x, y2, col = "blue", lty = 2) # add line
# Bar
barplot(table(df$category), main = "Counts",
col = "lightblue", las = 2)
# Histogram
hist(x, breaks = 30, col = "grey80",
main = "Distribution", xlab = "Value")
# Box plot
boxplot(value ~ group, data = df,
col = "lightyellow", main = "By Group")
# Multiple plots
par(mfrow = c(2, 2)) # 2x2 grid
# ... four plots ...
par(mfrow = c(1, 1)) # reset
# Save to file
png("plot.png", width = 800, height = 600)
plot(x, y)
dev.off()
# Add elements
legend("topright", legend = c("A", "B"),
col = c("red", "blue"), lty = 1)
abline(h = 0, lty = 2, col = "grey")
text(x, y, labels = names, pos = 3, cex = 0.8)
# Descriptive
mean(x); median(x); sd(x); var(x)
quantile(x, probs = c(0.25, 0.5, 0.75))
summary(df)
cor(x, y)
table(df$category) # frequency table
# Linear model
fit <- lm(y ~ x1 + x2, data = df)
summary(fit)
coef(fit)
predict(fit, newdata = new_df)
confint(fit)
# t-test
t.test(x, y) # two-sample
t.test(x, mu = 0) # one-sample
t.test(before, after, paired = TRUE)
# Chi-square
chisq.test(table(df$a, df$b))
# ANOVA
fit <- aov(value ~ group, data = df)
summary(fit)
TukeyHSD(fit)
# Correlation test
cor.test(x, y, method = "pearson")
# Merge (join)
merged <- merge(df1, df2, by = "id") # inner
merged <- merge(df1, df2, by = "id", all = TRUE) # full outer
merged <- merge(df1, df2, by = "id", all.x = TRUE) # left
# Reshape
wide <- reshape(long, direction = "wide",
idvar = "id", timevar = "time", v.names = "value")
long <- reshape(wide, direction = "long",
varying = list(c("v1", "v2")), v.names = "value")
# Sort
df[order(df$value), ] # ascending
df[order(-df$value), ] # descending
df[order(df$group, -df$value), ] # multi-column
# Remove duplicates
df[!duplicated(df), ]
df[!duplicated(df$id), ]
# Stack / combine
rbind(df1, df2) # stack rows (same columns)
cbind(df1, df2) # bind columns (same rows)
# Transform columns
df$log_val <- log(df$value)
df$category <- cut(df$value, breaks = c(0, 10, 20, Inf),
labels = c("low", "med", "high"))
ls() # list objects
rm(x) # remove object
rm(list = ls()) # clear all
str(obj) # structure
class(obj) # class
typeof(obj) # internal type
is.na(x) # check NA
complete.cases(df) # rows without NA
traceback() # after error
debug(my_func) # step through
browser() # breakpoint in code
system.time(expr) # timing
Sys.time() # current time
For deeper coverage, read the reference files in references/:
Non-obvious behaviors, surprising defaults, and tricky interactions β only what Claude doesn't already know:
I(), * vs :, /), aov gives wrong SS type, glm silently fits OLS, nls won't converge, predict returns wrong scale, optim/optimize needs tuning.d/p/q/r prefixes).vapply() over sapply() in production code β it enforces return typesseq_along(x) over 1:length(x) β the latter breaks when x is emptystringsAsFactors = FALSE in read.csv() / data.frame() (default changed in R 4.0)stop(), warning(), message() for error handling β not print()<<- assigns to parent environment β use sparingly and intentionallywith(df, expr) avoids repeating df$ everywhereSys.setenv() and .Renviron for environment variables
FILE:references/misc-utilities.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
do.call(fun, args_list) β args must be a list, even for a single argument.quote = TRUE prevents evaluation of arguments before the call β needed when passing expressions/symbols.substitute inside do.call differs from direct calls. Semantics are not fully defined for this case.do.call(rbind, list_of_dfs) to combine a list of data frames.R's functional programming helpers from base β genuinely non-obvious.
Reduce(f, x) applies binary function f cumulatively: Reduce("+", 1:4) = ((1+2)+3)+4. Direction matters for non-commutative ops.Reduce(f, x, accumulate = TRUE) returns all intermediate results β equivalent to Python's itertools.accumulate.Reduce(f, x, right = TRUE) folds from the right: f(x1, f(x2, f(x3, x4))).Reduce with init adds a starting value: Reduce(f, x, init = v) = f(f(f(v, x1), x2), x3).Filter(f, x) keeps elements where f(elem) is TRUE. Unlike x[sapply(x, f)], handles NULL/empty correctly.Map(f, ...) is a simple wrapper for mapply(f, ..., SIMPLIFY = FALSE) β always returns a list.Find(f, x) returns the first element where f(elem) is TRUE. Find(f, x, right = TRUE) for last.Position(f, x) returns the index of the first match (like Find but returns position, not value).lengths(x) returns the length of each element of a list. Equivalent to sapply(x, length) but faster (implemented in C).tryCatch unwinds the call stack β handler runs in the calling environment, not where the error occurred. Cannot resume execution.withCallingHandlers does NOT unwind β handler runs where the condition was signaled. Can inspect/log then let the condition propagate.tryCatch(expr, error = function(e) e) returns the error condition object.tryCatch(expr, warning = function(w) {...}) catches the first warning and exits. Use withCallingHandlers + invokeRestart("muffleWarning") to suppress warnings but continue.tryCatch finally clause always runs (like Java try/finally).globalCallingHandlers() registers handlers that persist for the session (useful for logging).stop(errorCondition("msg", class = "myError")) then catch with tryCatch(..., myError = function(e) ...).1.5e-8, i.e., sqrt(.Machine$double.eps)).TRUE or a character string describing the difference β NOT FALSE. Use isTRUE(all.equal(x, y)) in conditionals.tolerance argument controls numeric tolerance. scale for absolute vs relative comparison.==.combn(n, m) or combn(x, m): generates all combinations of m items from x.m rows; each column is one combination.FUN argument applies a function to each combination: combn(5, 3, sum) returns sums of all 3-element subsets.simplify = FALSE returns a list instead of a matrix.modifyList(x, val) replaces elements of list x with those in val by name.NULL removes that element from the list.x β it uses x[names(val)] <- val internally, so any name in val gets added or replaced.unlist: given a flat vector and a skeleton list, reconstructs the nested structure.relist(flesh, skeleton) β flesh is the flat data, skeleton provides the shape.txtProgressBar(min, max, style = 3) β style 3 shows percentage + bar (most useful).setTxtProgressBar(pb, value). Close with close(pb).|/-\, style 2: simple progress. Only style 3 shows percentage.format(object.size(x), units = "MB") for human-readable output.installed.packages() can be slow (scans all packages). Use find.package() or requireNamespace() to check for a specific package.update.packages(ask = FALSE) updates all packages without prompting.lib.loc specifies which library to check/update.vignette() lists all vignettes; vignette("name", package = "pkg") opens a specific one.demo() lists all demos; demo("topic") runs one interactively.browseVignettes() opens vignette browser in HTML.ts(data, start, frequency): frequency is observations per unit time (12 for monthly, 4 for quarterly).acf default type = "correlation". Use type = "partial" for PACF. plot = FALSE to suppress auto-plotting.arima(x, order = c(p,d,q)) for ARIMA models. seasonal = list(order = c(P,D,Q), period = S) for seasonal component.arima handles NA values in the time series (via Kalman filter).stl requires s.window (seasonal window) β must be specified, no default. s.window = "periodic" assumes fixed seasonality.decompose: simpler than stl, uses moving averages. type = "additive" or "multiplicative".stl result components: $time.series matrix with columns seasonal, trend, remainder.
FILE:references/data-wrangling.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
Indexing pitfalls in base R.
m[j = 2, i = 1] is m[2, 1] not m[1, 2] β argument names are ignored in [, positional matching only. Never name index args.x[f] uses integer codes of factor f, not its character labels. Use x[as.character(f)] for label-based indexing.x[[]] with no index is always an error. x$name does partial matching by default; x[["name"]] does not (exact by default).NULL via x[[i]] <- NULL or x$name <- NULL deletes that list element.[ with single column: df[, 1] returns a vector (drop=TRUE default for columns), but df[1, ] returns a data frame (drop=FALSE for rows). Use drop = FALSE explicitly.df[cbind(i,j)]) coerces to matrix first β avoid.Use interactively only; unsafe for programming.
subset argument uses non-standard evaluation β column names are resolved in the data frame, which can silently pick up wrong variables in programmatic use. Use [ with explicit logic in functions.NAs in the logical condition are treated as FALSE (rows silently dropped).droplevels().%in% never returns NA β this makes it safe for if() conditions unlike ==.match() returns position of first match only; duplicates in table are ignored.NaN matches NaN but not NA; NA matches NA only.apply coerces to matrix via as.matrix first β mixed types become character.c(n, dim(X)[MARGIN]). Row results become columns.... args cannot share names with X, MARGIN, or FUN (partial matching risk).sapply can return a vector, matrix, or list unpredictably β use vapply in non-interactive code with explicit FUN.VALUE template.lapply can cause dispatch issues; wrap in function(x) is.numeric(x) rather than bare is.numeric.sapply with simplify = "array" can produce higher-rank arrays (not just matrices).... args to FUN are not divided into cells β they apply globally, so FUN should not expect additional args with same length as X.default = NA fills empty cells; set default = 0 for sum-like operations. Before R 3.4.0 this was hard-coded to NA.array2DF() to convert result to a data frame.SIMPLIFY (all caps) not simplify β inconsistent with sapply.MoreArgs must be a list of args not vectorized over.by is intersect(names(x), names(y)) β can silently merge on unintended columns if data frames share column names.by = 0 or by = "row.names" merges on row names, adding a "Row.names" column.by = NULL (or both by.x/by.y length 0) produces Cartesian product.by columns by default (sort = TRUE). For unsorted output use sort = FALSE.f is a list of factors, interaction is used; levels containing "." can cause unexpected splits unless sep is changed.drop = FALSE (default) retains empty factor levels as empty list elements.split(df, ~ Month).cbind on data frames calls data.frame(...), not cbind.matrix. Mixing matrices and data frames can give unexpected results.rbind on data frames matches columns by name, not position. Missing columns get NA.cbind(NULL) returns NULL (not a matrix). For consistency, rbind(NULL) also returns NULL.useNA = "no"). Use useNA = "ifany" or exclude = NULL to count NAs.exclude non-empty and non-default implies useNA = "ifany".as.data.frame(tbl).useNA/exclude.duplicated marks the second and later occurrences as TRUE, not the first. Use fromLast = TRUE to reverse.unique keeps the first occurrence of each value.stringsAsFactors = FALSE is the default since R 4.0.0 (was TRUE before).I() to prevent conversion.check.names = FALSE, but many operations will de-dup them silently.I().as.numeric(f) returns integer codes, not original values. Use as.numeric(levels(f))[f] or as.numeric(as.character(f)).== and != work between factors; factors must have identical level sets. Ordered factors support <, >.c() on factors unions level sets (since R 4.1.0), but earlier versions converted to integer.aggregate(y ~ x, data, FUN)) drops NA groups by default.by as a list (not a vector).complete.cases(x, y) checks both).x[order(x)] to sort.-x for descending numeric, or decreasing = TRUE.method = "radix" for locale-independent fast sorting.sort.int() with method = "radix" is much faster for large integer/character vectors.
FILE:references/dates-and-system.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
Date objects are stored as integer days since 1970-01-01. Arithmetic works in days.Sys.Date() returns current date as Date object.seq.Date(from, to, by = "month") β "month" increments can produce varying-length intervals. Adding 1 month to Jan 31 gives Mar 3 (not Feb 28).diff(dates) returns a difftime object in days.format(date, "%Y") for year, "%m" for month, "%d" for day, "%A" for weekday name (locale-dependent).length(date_vector) <- n pads with NAs if extended.POSIXct: seconds since 1970-01-01 UTC (compact, a numeric vector).POSIXlt: list with components $sec, $min, $hour, $mday, $mon (0-11!), $year (since 1900!), $wday (0-6, Sunday=0), $yday (0-365).as.Date(posixct_obj) uses tz = "UTC" by default β may give different date than intended if original was in another timezone.Sys.time() returns POSIXct in current timezone.strptime returns POSIXlt; as.POSIXct(strptime(...)) to get POSIXct.difftime arithmetic: subtracting POSIXct objects gives difftime. Units auto-selected ("secs", "mins", "hours", "days", "weeks").difftime(time1, time2, units = "auto") β auto-selects smallest sensible unit."secs", "mins", "hours", "days", "weeks". No "months" or "years" (variable length).as.numeric(diff, units = "hours") to extract numeric value in specific units.units(diff_obj) <- "hours" changes the unit in place.system.time(expr) returns user, system, and elapsed time.gcFirst = TRUE (default): runs garbage collection before timing for more consistent results.proc.time() returns cumulative time since R started β take differences for intervals.elapsed (wall clock) can be less than user (multi-threaded BLAS) or more (I/O waits).Sys.sleep(seconds) β allows fractional seconds. Actual sleep may be longer (OS scheduling).Selected non-obvious options:
options(scipen = n): positive biases toward fixed notation, negative toward scientific. Default 0. Applies to print/format/cat but not sprintf.options(digits = n): significant digits for printing (1-22, default 7). Suggestion only.options(digits.secs = n): max decimal digits for seconds in time formatting (0-6, default 0).options(warn = n): -1 = ignore warnings, 0 = collect (default), 1 = immediate, 2 = convert to errors.options(error = recover): drop into debugger on error. options(error = NULL) resets to default.options(OutDec = ","): change decimal separator in output (affects format, print, NOT sprintf).options(stringsAsFactors = FALSE): global default for data.frame (moot since R 4.0.0 where it's already FALSE).options(expressions = 5000): max nested evaluations. Increase for deep recursion.options(max.print = 99999): controls truncation in print output.options(na.action = "na.omit"): default NA handling in model functions.options(contrasts = c("contr.treatment", "contr.poly")): default contrasts for unordered/ordered factors.file.path("a", "b", "c.txt") β "a/b/c.txt" (platform-appropriate separator).basename("/a/b/c.txt") β "c.txt". dirname("/a/b/c.txt") β "/a/b".file.path does NOT normalize paths (no .. resolution); use normalizePath() for that.list.files(pattern = "*.csv") β pattern is a regex, not a glob! Use glob2rx("*.csv") or "\\.csv$".full.names = FALSE (default) returns basenames only. Use full.names = TRUE for complete paths.recursive = TRUE to search subdirectories.all.files = TRUE to include hidden files (starting with .).size, isdir, mode, mtime, ctime, atime, uid, gid.mtime: modification time (POSIXct). Useful for file.info(f)$mtime.ctime is status-change time, not creation time.file_test("-f", path): TRUE if regular file exists.file_test("-d", path): TRUE if directory exists.file_test("-nt", f1, f2): TRUE if f1 is newer than f2.file.exists() for distinguishing files from directories.
FILE:references/io-and-text.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
sep = "" (default) means any whitespace (spaces, tabs, newlines) β not a literal empty string.comment.char = "#" by default β lines with # are truncated. Use comment.char = "" to disable (also faster).header auto-detection: set to TRUE if first row has one fewer field than subsequent rows (the missing field is assumed to be row names).colClasses = "NULL" skips that column entirely β very useful for speed.read.csv defaults differ from read.table: header = TRUE, sep = ",", fill = TRUE, comment.char = "".colClasses and nrows dramatically reduces memory usage. read.table is slow for wide data frames (hundreds of columns); use scan or data.table::fread for matrices.stringsAsFactors = FALSE since R 4.0.0 (was TRUE before).row.names = TRUE by default β produces an unnamed first column that confuses re-reading. Use row.names = FALSE or col.names = NA for Excel-compatible CSV.write.csv fixes sep = ",", dec = ".", and uses qmethod = "double" β cannot override these via ....quote = TRUE (default) quotes character/factor columns. Numeric columns are never quoted.widths is a vector of field widths.buffersize controls how many lines are read at a time; increase for large files.read.table internally after splitting fields.sep and quote arguments match those of read.table.perl = TRUE, fixed = TRUE. They behave differently for edge cases.x/pattern are matched positionally to ignore.case, perl, etc. Common source of silent bugs.sub replaces first match only; gsub replaces all matches."\\1" in replacement (double backslash in R strings). With perl = TRUE: "\\U\\1" for uppercase conversion.grep(value = TRUE) returns matching elements; grep(value = FALSE) (default) returns indices.grepl returns logical vector β preferred for filtering.regexpr returns first match position + length (as attributes); gregexpr returns all matches as a list.regexec returns match + capture group positions; gregexec does this for all matches.[:alpha:] must be inside [[:alpha:]] (double brackets) in POSIX mode.split = "" or split = character(0) splits into individual characters."". Match at end: no trailing "".fixed = TRUE is faster and avoids regex interpretation.fixed, perl, etc.substr(x, start, stop): extracts/replaces substring. 1-indexed, inclusive on both ends.substring(x, first, last): same but last defaults to 1000000L (effectively "to end"). Vectorized over first/last.substr(x, 1, 3) <- "abc" replaces in place (must be same length replacement).which = "both" (default), "left", or "right".whitespace = "[ \\t\\r\\n]" β customizable regex for what counts as whitespace.type = "bytes" counts bytes; type = "chars" (default) counts characters; type = "width" counts display width.nchar(NA) returns NA (not 2). nchar(factor) works on the level labels.keepNA = TRUE (default since R 3.3.0); set to FALSE to count "NA" as 2 characters.format(x, digits, nsmall): nsmall forces minimum decimal places. big.mark = "," adds thousands separator.formatC(x, format = "f", digits = 2): C-style formatting. format = "e" for scientific, "g" for general.format returns character vector; always right-justified by default (justify = "right").as.is = TRUE (recommended): keeps characters as character, not factor.tryLogical = TRUE (R 4.3+) converts "TRUE"/"FALSE" columns.commandArgs(trailingOnly = TRUE) gets script arguments (excluding R/Rscript flags).#! line on Unix: /usr/bin/env Rscript or full path.--vanilla or --no-init-file to skip .Rprofile loading.quit(status = 1) for error exit.cat, print, or any expression that writes to stdout.file = NULL (default) returns character vector. file = "out.txt" writes directly to file.type = "message" captures stderr instead.URLencode(url, reserved = FALSE) by default does NOT encode reserved chars (/, ?, &, etc.).reserved = TRUE to encode a URL component (query parameter value).glob2rx("*.csv") β "^.*\\.csv$".list.files(pattern = glob2rx("data_*.RDS")).
FILE:references/modeling.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
Symbolic model specification gotchas.
I() is required to use arithmetic operators literally: y ~ x + I(x^2). Without I(), ^ means interaction crossing.* = main effects + interaction: a*b expands to a + b + a:b.(a+b+c)^2 = all main effects + all 2-way interactions (not squaring).- removes terms: (a+b+c)^2 - a:b drops only the a:b interaction./ means nesting: a/b = a + b %in% a = a + a:b.. in formula means "all other columns in data" (in terms.formula context) or "previous contents" (in update.formula).as.formula("y ~ x") uses parent.frame().model.matrix creates the design matrix including dummy coding. Default contrasts: contr.treatment for unordered factors, contr.poly for ordered.terms object attributes: order (interaction order per term), intercept, factors matrix.model.matrix can be surprising: e.g., factorLevelName concatenation.family = gaussian(link = "identity") β glm() with no family silently fits OLS (same as lm, but slower and with deviance-based output).binomial(link = "logit"), poisson(link = "log"), Gamma(link = "inverse"), inverse.gaussian().binomial accepts response as: 0/1 vector, logical, factor (second level = success), or 2-column matrix cbind(success, failure).weights in glm means prior weights (not frequency weights) β for frequency weights, use the cbind trick or offset.predict.glm(type = "response") for predicted probabilities; default type = "link" returns log-odds (for logistic) or log-rate (for Poisson).anova(glm_obj, test = "Chisq") for deviance-based tests; "F" is invalid for non-Gaussian families.quasibinomial, quasipoisson) allow overdispersion β no AIC is computed.control = glm.control(maxit = 100) if default 25 iterations isn't enough.aov is a wrapper around lm that stores extra info for balanced ANOVA. For unbalanced designs, Type I SS (sequential) are computed β order of terms matters.car::Anova() or set contrasts to contr.sum/contr.helmert.aov(y ~ A*B + Error(Subject/B)).summary.aov gives ANOVA table; summary.lm(aov_obj) gives regression-style summary.start = list(...) or convergence fails.SSlogis, SSasymp, etc.) auto-compute starting values."port" allows bounds on parameters (lower/upper).control = list(scaleOffset = 1) or jitter data.weights argument for weighted NLS; na.action for missing value handling.step does stepwise model selection by AIC (default). Use k = log(n) for BIC.direction = "both" (default), "forward", or "backward".add1/drop1 evaluate single-term additions/deletions; step calls these iteratively.scope argument defines the upper/lower model bounds for search.step modifies the model object in place β can be slow for large models with many candidate terms.predict.lm with interval = "confidence" gives CI for mean response; interval = "prediction" gives PI for new observation (wider).newdata must have columns matching the original formula variables β factors must have the same levels.predict.glm with type = "response" gives predictions on the response scale (e.g., probabilities for logistic); type = "link" (default) gives on the link scale.se.fit = TRUE returns standard errors; for predict.glm these are on the link scale regardless of type.predict.lm with type = "terms" returns the contribution of each term.span controls smoothness (default 0.75). Span < 1 uses that proportion of points; span > 1 uses all points with adjusted distance.degree = 0 (local constant) is allowed but poorly tested β use with caution.loess; conditioning is not implemented.normalize = TRUE (default) standardizes predictors to common scale; set FALSE for spatial coords.lowess is the older function; returns list(x, y) β cannot predict at new points.loess is the newer formula interface with predict method.lowess parameter is f (span, default 2/3); loess parameter is span (default 0.75).lowess iter default is 3 (robustifying iterations); loess default family = "gaussian" (no robustness).cv = TRUE uses ordinary leave-one-out CV instead β do not use with duplicate x values.spar and lambda control smoothness; df can specify equivalent degrees of freedom.predict, print, plot methods. The fit component has knots and coefficients.control = list(fnscale = -1)."Brent" or optimize()."L-BFGS-B" is the only method supporting box constraints (lower/upper). Bounds auto-select this method with a warning."SANN" (simulated annealing): convergence code is always 0 β it never "fails". maxit = total function evals (default 10000), no other stopping criterion.parscale: scale parameters so unit change in each produces comparable objective change. Critical for mixed-scale problems.hessian = TRUE: returns numerical Hessian of the unconstrained problem even if box constraints are active.fn can return NA/Inf (except "L-BFGS-B" which requires finite values always). Initial value must be finite.optimize: 1D minimization on a bounded interval. Returns minimum and objective.uniroot: finds a root of f in [lower, upper]. Requires f(lower) and f(upper) to have opposite signs.uniroot with extendInt = "yes" can auto-extend the interval to find sign change β but can find spurious roots for functions that don't actually cross zero.nlm: Newton-type minimizer. Gradient/Hessian as attributes of the return value from fn (unusual interface).aov object (not lm).conf.level = 0.95. Returns adjusted p-values and confidence intervals for all pairwise comparisons.anova(model): sequential (Type I) SS β order of terms matters.anova(model1, model2): F-test comparing nested models.car::Anova().
FILE:references/statistics.mdNon-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
correct = TRUE (default) applies Yates continuity correction for 2x2 tables only.simulate.p.value = TRUE: Monte Carlo with B = 2000 replicates (min p ~ 0.0005). Simulation assumes fixed marginals (Fisher-style sampling, not the chi-sq assumption).p must sum to 1 (or set rescale.p = TRUE).$expected, $residuals (Pearson), and $stdres (standardized).exact = TRUE by default for small samples with no ties. With ties, normal approximation used.correct = TRUE applies continuity correction to normal approximation.conf.int = TRUE computes Hodges-Lehmann estimator and confidence interval (not just the p-value).paired = TRUE uses signed-rank test (Wilcoxon), not rank-sum (Mann-Whitney).simulate.p.value = TRUE) or network algorithm.workspace controls memory for the network algorithm; increase if you get errors on large tables.or argument tests a specific odds ratio (default 1) β only for 2x2 tables.dgof or ks.test with exact = NULL for discrete distributions."holm" (default), "BH" (Benjamini-Hochberg FDR), "bonferroni", "BY", "hochberg", "hommel", "fdr" (alias for BH), "none".n argument: total number of hypotheses (can be larger than length(p) if some p-values are excluded).NAs: adjusted p-values are NA where input is NA.p.adjust.method defaults to "holm". Change to "BH" for FDR control.pool.sd = TRUE (default for t-test): uses pooled SD across all groups (assumes equal variances).nstart > 1 recommended (e.g., nstart = 25): runs algorithm from multiple random starts, returns best.iter.max = 10 β may be too low for convergence. Increase for large/complex data.ifault = 4).method = "ward.D2" implements Ward's criterion correctly (using squared distances). The older "ward.D" did not square distances (retained for back-compatibility).dist object. Use as.dist() to convert a symmetric matrix.hang = -1 in plot() aligns all labels at the bottom.method = "euclidean" (default). Other options: "manhattan", "maximum", "canberra", "binary", "minkowski".dist object (lower triangle only). Use as.matrix() to get full matrix."canberra": terms with zero numerator and denominator are omitted from the sum (not treated as 0/0).Inf values: Euclidean distance involving Inf is Inf. Multiple Infs in same obs give NaN for some methods.prcomp uses SVD (numerically superior); princomp uses eigen on covariance (less stable, N-1 vs N scaling).scale. = TRUE in prcomp standardizes variables; important when variables have very different scales.princomp standard deviations differ from prcomp by factor sqrt((n-1)/n).$rotation (loadings) and $x (scores); sign of components may differ between runs.bw = "nrd0" (Silverman's rule of thumb). For multimodal data, consider "SJ" or "bcv".adjust: multiplicative factor on bandwidth. adjust = 0.5 halves the bandwidth (less smooth)."gaussian". Range of density extends beyond data range (controlled by cut, default 3 bandwidths).n = 512: number of evaluation points. Increase for smoother plotting.from/to: explicitly bound the evaluation range.type options (1-9). Default type = 7 (R default, linear interpolation). Type 1 = inverse of empirical CDF (SAS default). Types 4-9 are continuous; 1-3 are discontinuous.na.rm = FALSE by default β returns NA if any NAs present.names = TRUE by default, adding "0%", "25%", etc. as names.All distribution functions follow the d/p/q/r pattern. Common non-obvious points:
n argument in r*() functions: if length(n) > 1, uses length(n) as the count, not n itself. So rnorm(c(1,2,3)) generates 3 values, not 1+2+3.log = TRUE / log.p = TRUE: compute on log scale for numerical stability in tails.lower.tail = FALSE gives survival function P(X > x) directly (more accurate than 1 - pnorm() in tails).shape and rate (= 1/scale). Default rate = 1. Specifying both rate and scale is an error.shape1 (alpha), shape2 (beta) β no mean/sd parameterization.dpois: x can be non-integer (returns 0 with a warning for non-integer values if log = FALSE).shape and scale (no rate). R's parameterization: f(x) = (shape/scale)(x/scale)^(shape-1) exp(-(x/scale)^shape).meanlog and sdlog are mean/sd of the log, not of the distribution itself."pearson". Also "kendall" and "spearman".$estimate, $p.value, $conf.int (CI only for Pearson).cor.test(~ x + y, data = df) β note the ~ with no LHS.Fn <- ecdf(x); Fn(3.5).plot(ecdf(x)) gives the empirical CDF plot.NA in weights: observation is dropped if weight is NA.Non-obvious behaviors, gotchas, and tricky defaults for R functions. Only what Claude doesn't already know.
par() settings are per-device. Opening a new device resets everything.mfrow/mfcol resets cex to 1 and mex to 1. With 2x2 layout, base cex is multiplied by 0.83; with 3+ rows/columns, by 0.66.mai (inches), mar (lines), pin, plt, pty all interact. Restoring all saved parameters after device resize can produce inconsistent results β last-alphabetically wins.bg set via par() also sets new = FALSE. Setting fg via par() also sets col.xpd = NA clips to device region (allows drawing in outer margins); xpd = TRUE clips to figure region; xpd = FALSE (default) clips to plot region.mgp = c(3, 1, 0): controls title line (mgp[1]), label line (mgp[2]), axis line (mgp[3]). All in mex units.las: 0 = parallel to axis, 1 = horizontal, 2 = perpendicular, 3 = vertical. Does not respond to srt.tck = 1 draws grid lines across the plot. tcl = -0.5 (default) gives outward ticks.usr with log scale: contains log10 of the coordinate limits, not the raw values.cin, cra, csi, cxy, din, page.layout(mat) where mat is a matrix of integers specifying figure arrangement.widths/heights accept lcm() for absolute sizes mixed with relative sizes.mfrow/mfcol but cannot be queried once set (unlike par("mfrow")).layout.show(n) visualizes the layout for debugging.axis(side, at, labels): side 1=bottom, 2=left, 3=top, 4=right.par("mgp"). Labels can overlap if not managed.mtext: line argument positions text in margin lines (0 = adjacent to plot, positive = outward). adj controls horizontal position (0-1).mtext with outer = TRUE writes in the outer margin (set by par(oma = ...)).x or a function: curve(sin, 0, 2*pi) or curve(x^2 + 1, 0, 10).add = TRUE to overlay on existing plot. Default n = 101 evaluation points.xname = "x" by default; change if your expression uses a different variable name.panel function receives (x, y, ...) for each pair. lower.panel, upper.panel, diag.panel for different regions.gap controls spacing between panels (default 1).pairs(~ var1 + var2 + var3, data = df).coplot(y ~ x | a) or coplot(y ~ x | a * b) for two conditioning variables.panel function can be customized; rows/columns control layout.panel = panel.smooth for loess overlay.col, lty, pch across columns.type = "l" by default (unlike plot which defaults to "p").contour(x, y, z): z must be a matrix with dim = c(length(x), length(y)).filled.contour has a non-standard layout β it creates its own plot region for the color key. Cannot use par(mfrow) with it. Adding elements requires the plot.axes argument.image: plots z-values as colored rectangles. Default color scheme may be misleading; set col explicitly.image, x and y specify cell boundaries or midpoints depending on context.persp(x, y, z, theta, phi): theta = azimuthal angle, phi = colatitude.trans3d() to add points/lines to the perspective plot.shade and col control surface shading. border = NA removes grid lines.arrows: code = 1 (head at start), code = 2 (head at end, default), code = 3 (both).polygon: last point auto-connects to first. Fill with col; border controls outline.rect(xleft, ybottom, xright, ytop) β note argument order is not the same as other systems.dev.new() opens a new device. dev.off() closes current device (and flushes output for file devices like pdf).dev.off() on the last open device reverts to null device.dev.copy(pdf, file = "plot.pdf") followed by dev.off() to save current plot.dev.list() returns all open devices; dev.cur() the active one.dev.off() to finalize the file. Without it, file may be empty/corrupt.onefile = TRUE (default): multiple pages in one PDF. onefile = FALSE: one file per page (uses %d in filename for numbering).useDingbats = FALSE recommended to avoid issues with certain PDF viewers and pch symbols.family controls font family.res controls DPI (default 72). For publication: res = 300 with appropriate width/height in pixels or inches (with units = "in").type = "cairo" (on systems with cairo) gives better antialiasing than default.bg = "transparent" for transparent background (PNG supports alpha).colors() returns all 657 named colors. col2rgb("color") returns RGB matrix.rgb(r, g, b, alpha, maxColorValue = 255) β note maxColorValue default is 1, not 255.hcl(h, c, l): perceptually uniform color space. Preferred for color scales.adjustcolor(col, alpha.f = 0.5): easy way to add transparency.colorRamp returns a function mapping [0,1] to RGB matrix.colorRampPalette returns a function taking n and returning n interpolated colors.space = "Lab" gives more perceptually uniform interpolation than "rgb".palette() returns current palette (default 8 colors). palette("Set1") sets a built-in palette.recordPlot() / replayPlot(): save and restore a complete plot β device-dependent and fragile across sessions.
FILE:assets/analysis_template.Rrm(list = ls())
set.seed(42)
df <- read.csv("your_data.csv", stringsAsFactors = FALSE)
dim(df) str(df) head(df, 10) summary(df)
na_report <- data.frame( column = names(df), n_miss = colSums(is.na(df)), pct_miss = round(colMeans(is.na(df)) * 100, 1), row.names = NULL ) print(na_report[na_report$n_miss > 0, ])
n_dup <- sum(duplicated(df)) cat(sprintf("Duplicate rows: %d\n", n_dup))
cat_cols <- names(df)[sapply(df, function(x) is.character(x) | is.factor(x))] for (col in cat_cols) { cat(sprintf("\n%s (%d unique):\n", col, length(unique(df[[col]])))) print(table(df[[col]], useNA = "ifany")) }
num_cols <- names(df)[sapply(df, is.numeric)] round(sapply(df[num_cols], function(x) c( n = sum(!is.na(x)), mean = mean(x, na.rm = TRUE), sd = sd(x, na.rm = TRUE), median = median(x, na.rm = TRUE), min = min(x, na.rm = TRUE), max = max(x, na.rm = TRUE) )), 3)
par(mfrow = c(2, 2))
hist(df$outcome_var, main = "Distribution of Outcome", xlab = "Outcome", col = "steelblue", border = "white", breaks = 30)
boxplot(outcome_var ~ group_var, data = df, main = "Outcome by Group", col = "lightyellow", las = 2)
plot(df$predictor, df$outcome_var, main = "Predictor vs Outcome", xlab = "Predictor", ylab = "Outcome", pch = 19, col = adjustcolor("steelblue", alpha.f = 0.5), cex = 0.8) abline(lm(outcome_var ~ predictor, data = df), col = "red", lwd = 2)
cor_mat <- cor(df[num_cols], use = "complete.obs") image(cor_mat, main = "Correlation Matrix", col = hcl.colors(20, "RdBu", rev = TRUE))
par(mfrow = c(1, 1))
t.test(outcome_var ~ group_var, data = df)
fit <- lm(outcome_var ~ predictor1 + predictor2 + group_var, data = df) summary(fit) confint(fit)
par(mfrow = c(2, 2)) plot(fit) par(mfrow = c(1, 1))
shapiro.test(residuals(fit))
FILE:scripts/check_data.R
check_data <- function(df, top_n_levels = 8) {
if (!is.data.frame(df)) stop("Input must be a data frame.")
n_row <- nrow(df) n_col <- ncol(df)
cat("ββββββββββββββββββββββββββββββββββββββββββ\n") cat(" DATA QUALITY REPORT\n") cat("ββββββββββββββββββββββββββββββββββββββββββ\n") cat(sprintf(" Rows: %d Columns: %d\n", n_row, n_col)) cat("ββββββββββββββββββββββββββββββββββββββββββ\n\n")
cat("ββ COLUMN OVERVIEW ββββββββββββββββββββββββ\n")
for (col in names(df)) { x <- df[[col]] cls <- class(x)[1] n_na <- sum(is.na(x)) pct <- round(n_na / n_row * 100, 1) n_uniq <- length(unique(x[!is.na(x)]))
na_flag <- if (n_na == 0) "" else sprintf(" *** %d NAs (%.1f%%)", n_na, pct)
cat(sprintf(" %-20s %-12s %d unique%s\n",
col, cls, n_uniq, na_flag))
}
cat("\nββ NA SUMMARY βββββββββββββββββββββββββββββ\n")
na_counts <- sapply(df, function(x) sum(is.na(x))) cols_with_na <- na_counts[na_counts > 0]
if (length(cols_with_na) == 0) { cat(" No missing values. \n") } else { cat(sprintf(" Columns with NAs: %d of %d\n\n", length(cols_with_na), n_col)) for (col in names(cols_with_na)) { bar_len <- round(cols_with_na[col] / n_row * 20) bar <- paste0(rep("β", bar_len), collapse = "") pct_na <- round(cols_with_na[col] / n_row * 100, 1) cat(sprintf(" %-20s [%-20s] %d (%.1f%%)\n", col, bar, cols_with_na[col], pct_na)) } }
num_cols <- names(df)[sapply(df, is.numeric)]
if (length(num_cols) > 0) { cat("\nββ NUMERIC COLUMNS ββββββββββββββββββββββββ\n") cat(sprintf(" %-20s %8s %8s %8s %8s %8s\n", "Column", "Min", "Mean", "Median", "Max", "SD")) cat(sprintf(" %-20s %8s %8s %8s %8s %8s\n", "ββββββ", "βββ", "ββββ", "ββββββ", "βββ", "ββ"))
for (col in num_cols) {
x <- df[[col]][!is.na(df[[col]])]
if (length(x) == 0) next
cat(sprintf(" %-20s %8.3g %8.3g %8.3g %8.3g %8.3g\n",
col,
min(x), mean(x), median(x), max(x), sd(x)))
}
}
cat_cols <- names(df)[sapply(df, function(x) is.factor(x) | is.character(x))]
if (length(cat_cols) > 0) { cat("\nββ CATEGORICAL COLUMNS ββββββββββββββββββββ\n")
for (col in cat_cols) {
x <- df[[col]]
tbl <- sort(table(x, useNA = "no"), decreasing = TRUE)
n_lv <- length(tbl)
cat(sprintf("\n %s (%d unique values)\n", col, n_lv))
show <- min(top_n_levels, n_lv)
for (i in seq_len(show)) {
lbl <- names(tbl)[i]
cnt <- tbl[i]
pct <- round(cnt / n_row * 100, 1)
cat(sprintf(" %-25s %5d (%.1f%%)\n", lbl, cnt, pct))
}
if (n_lv > top_n_levels) {
cat(sprintf(" ... and %d more levels\n", n_lv - top_n_levels))
}
}
}
cat("\nββ DUPLICATES βββββββββββββββββββββββββββββ\n") n_dup <- sum(duplicated(df)) if (n_dup == 0) { cat(" No duplicate rows.\n") } else { cat(sprintf(" %d duplicate row(s) found (%.1f%% of data)\n", n_dup, n_dup / n_row * 100)) }
cat("\nββββββββββββββββββββββββββββββββββββββββββ\n") cat(" END OF REPORT\n") cat("ββββββββββββββββββββββββββββββββββββββββββ\n")
invisible(list( dims = c(rows = n_row, cols = n_col), na_counts = na_counts, n_dupes = n_dup )) } FILE:scripts/scaffold_analysis.R #!/usr/bin/env Rscript
scaffold_analysis <- function(project_name, outcome = "outcome", group = "group", data_file = NULL) {
if (is.null(data_file)) data_file <- paste0(project_name, ".csv") out_file <- paste0(project_name, "_analysis.R")
template <- sprintf( '# ============================================================
df <- read.csv("%s", stringsAsFactors = FALSE)
cat("Dimensions:", dim(df), "\n") str(df) head(df)
summary(df)
na_counts <- colSums(is.na(df)) na_counts[na_counts > 0]
hist(df$%s, main = "Distribution of %s", xlab = "%s")
if ("%s" %%in%% names(df)) { table(df$%s) barplot(table(df$%s), main = "Counts by %s", col = "steelblue", las = 2) }
tapply(df$%s, df$%s, mean, na.rm = TRUE) tapply(df$%s, df$%s, sd, na.rm = TRUE)
fit <- lm(%s ~ %s, data = df) summary(fit) confint(fit)
par(mfrow = c(1, 2))
boxplot(%s ~ %s, data = df, main = "%s by %s", xlab = "%s", ylab = "%s", col = "lightyellow")
plot(fit, which = 1) # residuals vs fitted
par(mfrow = c(1, 1))
', project_name, format(Sys.Date(), "%%Y-%%m-%%d"), data_file, # Section 2 β EDA outcome, outcome, outcome, group, group, group, group, # Section 3 group, group, # Section 4 outcome, group, outcome, group, outcome, group, outcome, group, outcome, group, outcome, group, # Section 5 outcome, group, outcome, group, group, outcome, # Section 6 project_name, project_name, project_name, outcome, group )
writeLines(template, out_file) cat(sprintf("Created: %s\n", out_file)) invisible(out_file) }
if (!interactive()) { args <- commandArgs(trailingOnly = TRUE)
if (length(args) == 0) { cat("Usage: Rscript scaffold_analysis.R <project_name> [outcome_var] [group_var]\n") cat("Example: Rscript scaffold_analysis.R myproject score treatment\n") quit(status = 1) }
project <- args[1] outcome <- if (length(args) >= 2) args[2] else "outcome" group <- if (length(args) >= 3) args[3] else "group"
scaffold_analysis(project, outcome = outcome, group = group) } FILE:README.md
GitHub: https://github.com/iremaydas/base-r-skill
A Claude Code skill for base R programming.
I'm a political science PhD candidate who uses R regularly but would never call myself an R person. I needed a Claude Code skill for base R β something without tidyverse, without ggplot2, just plain R β and I couldn't find one anywhere.
So I made one myself. At 11pm. Asking Claude to help me build a skill for Claude.
If you're also someone who Googles how to drop NA rows in R every single time, this one's for you. π«Ά
base-r/
βββ SKILL.md # Main skill file
βββ references/ # Gotchas & non-obvious behaviors
β βββ data-wrangling.md # Subsetting traps, apply family, merge, factor quirks
β βββ modeling.md # Formula syntax, lm/glm/aov/nls, optim
β βββ statistics.md # Hypothesis tests, distributions, clustering
β βββ visualization.md # par, layout, devices, colors
β βββ io-and-text.md # read.table, grep, regex, format
β βββ dates-and-system.md # Date/POSIXct traps, options(), file ops
β βββ misc-utilities.md # tryCatch, do.call, time series, utilities
βββ scripts/
β βββ check_data.R # Quick data quality report for any data frame
β βββ scaffold_analysis.R # Generates a starter analysis script
βββ assets/
βββ analysis_template.R # Copy-paste analysis template
The reference files were condensed from the official R 4.5.3 manual β 19,518 lines β 945 lines (95% reduction). Only the non-obvious stuff survived: gotchas, surprising defaults, tricky interactions. The things Claude already knows well got cut.
Add this skill to your Claude Code setup by pointing to this repo. Then Claude will automatically load the relevant reference files when you're working on R tasks.
Works best for:
lm, glm, aovplot, par, barplotNot for: tidyverse, ggplot2, Shiny, or R package development.
check_data.R ScriptProbably the most useful standalone thing here. Source it and run check_data(df) on any data frame to get a formatted report of dimensions, NA counts, numeric summaries, and categorical breakdowns.
source("scripts/check_data.R")
check_data(your_df)
If you spot a missing gotcha, a wrong default, or something that should be in the references β PRs are very welcome. I'm learning too.
Made by @iremaydas β PhD candidate, occasional R user, full-time Googler of things I should probably know by now.
Spezialisierter KI-Assistent: {name}. an AI playing an Akinator-style guessing game.
Spezialisierter KI-Assistent: "Idea Clarifier" a specialized version of ChatGPT optimized for helping users refine and clarify their ideas. Your role involves interacting with users' initial concepts,
Develop a first-person shooter game using Three.js and JavaScript.
βΉοΈ Dieser Prompt stammt aus der Open-Source-Community-Sammlung prompts.chat und steht unter der CC0-Lizenz (Public Domain). Kostenlos fΓΌr jeden Einsatz.