Latest News

Latest

Data Warehouse Concepts

OBIEE Errors

What's New

OBIEE Performance Tips

Sponsor

Big Data

Natural Language Processing

Machine Learning

Latest News

Thursday, June 13, 2024

 

Scheduler tables in OBIEE / OAS



Oracle BI Scheduler is a server that manages and schedules jobs. When a user creates and schedules an agent, Oracle BI Presentation Services gathers information about the agent such as its priority, the intended recipients, and the devices to which content should be delivered. Presentation Services packages this information and other characteristics into a job, then informs Oracle BI Scheduler when to execute the job.

Agents can run in parallel on different threads. The number of agents that can run in parallel depends on the size of the Scheduler thread pool (a configurable setting) and the number of threads used up by each agent. Queueing might occur if too many agents are triggered at the same time.

Oracle BI Scheduler uses a single back-end database to store pertinent information about a job, its instances, and its parameters.

The details about the scheduler tables are mentioned below :

S_NQ_JOB - This table is used by Scheduler to store information about scheduled jobs. That means when you create a new agent from OBIEE, an entry corresponding to it get created in S_NQ_JOB table. This table is stored in the BI_PLATFORM schema, therefore to access it; you should have access to BI_PLATFORM schema.

Some of the columns in the table are mentioned below 

Column Name
Description
JOB_ID
It is unique identifier for each agent
NAME
Name of Agent
NEXT_RUN_TIME_TS
Next scheduled runtime of agent
LAST_RUN_TIME_TS
Last runtime of the agent

S_NQ_INSTANCE – This table stores information about scheduled job instances. For a job in the S_NQ_JOB table, there will be multiple entries in the S_NQ_INSTANCE table based on the agent run.

Column Name
Description
JOB_ID
Identifier for the job populated from S_NQ_JOB
INSTANCE_ID
Unique identifier for each instance
STATUS
Shows the status of the agent
0 - Completed
1 - Running
2 - Failed
3 - Cancelled
5 - Timed out
BEGIN_TS
Start of the instance
END_TS
End of instance
EXIT_CODE
Number of e-mails sent by an Agent job, after the job is completed

Relation with  S_NQ_JOB
S_NQ_JOB.JOB_ID = S_NQ_INSTANCE.JOB_ID

S_NQ_ERR_MSG - This table stores error messages for Scheduler job instances that do not complete successfully. 

Column Name
Description
JOB_ID
Same as in S_NQ_JOB & S_NQ_INSTANCE
INSTANCE_ID
Same as S_NQ_INSTANCE
ERROR_MSG_TXT
Displays the error message in agent failure. This message is same as the message in Agent.log file.

Relation with  S_NQ_INSTANCE
S_NQ_INSTANCE.JOB_ID  =  S_NQ_ERR_MSG.JOB_ID
AND  S_NQ_INSTANCE.INSTANCE_ID  =  S_NQ_ERR_MSG.INSTANCE_ID
 
S_NQ_JOB_PARAM - This table holds information about Scheduler job parameters for scheduled jobs.
Relation with  S_NQ_JOB
S_NQ_JOB.JOB_ID = S_NQ_JOB_PARAM.JOB_ID

Wednesday, May 1, 2024

ChromaDB and Faiss are libraries tailored for managing and querying extensive vector databases, each with distinct focal points and attributes.


ChromaDB:
Emphasis: ChromaDB specializes in handling and exploring vast arrays of color data, particularly in the realms of computer vision and image processing. Its optimizations are finely tuned for color histograms and analogous representations.
Attributes:
Specialized Color Indexing: ChromaDB offers bespoke indexing techniques crafted for color data, ensuring efficient storage and retrieval of color-centric information.
Color Similarity Queries: Its architecture facilitates swift retrieval of similar colors based on defined criteria, a valuable feature for tasks like image analysis and retrieval.
Applications: ChromaDB is prevalent in scenarios where color precision is paramount, such as image and video processing, demanding efficient searches based on color resemblances.

Faiss:
Emphasis: Faiss, an acronym for Facebook AI Similarity Search, is a versatile library designed for similarity searches across expansive vector databases. It's agnostic to specific data types, rendering it adaptable to diverse applications.
Attributes:
Versatility: Faiss accommodates various indexing methods and similarity metrics, ensuring adaptability to different vector data types.
Efficiency: Highly optimized for rapid processing and minimal memory consumption, Faiss is adept at handling substantial datasets with efficiency.
Integration with Deep Learning: Faiss integrates seamlessly with deep learning frameworks, enabling similarity searches on learned embeddings, enriching applications with enhanced search capabilities.
Applications: Faiss finds extensive use in critical similarity search applications like recommendation systems, natural language processing, and image retrieval, due to its ability to handle diverse vector data types with agility.

Ultimately, the choice between ChromaDB and Faiss hinges on the nature of your data and the specific needs of your application. ChromaDB excels in efficient color-based similarity searches, ideal for color-centric applications. Conversely, Faiss stands out as a versatile and potent option for general-purpose similarity searches across extensive vector datasets.

Tuesday, January 23, 2024

In today's data-hungry world, building efficient pipelines to ingest, process, and deliver insights is vital. Platforms like Azure empower data engineers to craft robust and scalable pipelines like never before. 

This guide dives deep into the essential components and best practices of crafting Azure data pipelines, equipping you with practical tips to unleash the full potential of your data flow.

Understanding Data Pipelines:

A data pipeline is a series of interconnected processes that extract, transform, and load (ETL) data from various sources into a target destination, typically a data warehouse, database, or analytical system. The goal is to ensure data is collected, cleansed, and transformed into a usable format for analysis and decision-making.

  1. Components of a Data Pipeline:
  2. Data Sources: Identify the sources of data, which can range from databases, APIs, logs, and external feeds. Azure offers connectors for various sources like Azure SQL Database, Azure Blob Storage, and more.
  3. Data Transformation: This stage involves cleansing, enriching, and transforming the raw data into a structured format. Azure Data Factory, Azure Databricks, and Azure HDInsight are popular tools for this purpose.
  4. Data Movement: Move data efficiently between different storage solutions and services within Azure using Azure Data Factory or Azure Copy Data.
  5. Data Loading: Load the transformed data into the destination, which could be Azure SQL Data Warehouse, Azure Synapse Analytics, or other databases.
  6. Orchestration: Tools like Azure Logic Apps or Apache Airflow can be used to orchestrate the entire pipeline, ensuring the right steps are executed in the correct order.

Best Practices for Azure Data Pipeline Design:

  • Scalability and Elasticity: Leverage Azure's scalability by using services like Azure Databricks or Azure Synapse Analytics to handle varying data workloads.
  • Data Security and Compliance: Implement Azure's security features to protect sensitive data at rest and in transit. Use Azure Key Vault for managing keys and secrets.
  • Modularity: Design pipelines as modular components to facilitate reusability and easier maintenance. This also helps in debugging and troubleshooting.
  • Monitoring and Logging: Implement robust monitoring and logging using Azure Monitor and Azure Log Analytics to track pipeline performance and identify issues.
  • Data Partitioning: When dealing with large datasets, use partitioning strategies to optimize data storage and retrieval efficiency.
  • Backup and Disaster Recovery: Ensure data integrity and availability by implementing backup and disaster recovery solutions provided by Azure.

Building a Customer Analytics Pipeline (Example ):

  • Let's consider an example of building a customer analytics pipeline in Azure:
  • Data Extraction: Extract customer data from Azure SQL Database and external CRM APIs.
  • Data Transformation: Use Azure Databricks to cleanse and transform the data, calculating metrics like customer lifetime value and segmentation.
  • Data Loading: Load the transformed data into Azure Synapse Analytics for further analysis.
  • Orchestration: Use Azure Data Factory to schedule and orchestrate the entire process.

Conclusion:

Creating efficient data pipelines in Azure necessitates a profound comprehension of the platform's services and data engineering principles. By adhering to best practices, taking into account scalability, security, and performance, and harnessing the extensive Azure ecosystem, you can develop data pipelines that deliver precise, timely, and actionable insights, propelling your organization toward success. It is crucial to tailor these practices to your unique use case and consistently iterate to enhance the pipeline's efficiency and dependability.

Monday, October 30, 2023

A distributed, fault-tolerant data warehousing system, Apache Hive allows for large-scale analytics. Hive Metastore (HMS) is an essential part of many data lake systems because it offers a central repository of metadata that can be readily analysed to make data-driven choices. Hive is based on Apache Hadoop and uses HDFS to provide storage on S3, adls, gs, and other platforms. SQL can be used by Hive users to read, write, and manage petabytes of data.


Hive Metastore Server (HMS):

  • Using the metastore service API, clients (such as Hive, Impala, and Spark) can access the central repository of metadata for Hive tables and partitions in a relational database, which is called the Hive Metastore (HMS). 

  • It is now a fundamental component of data lakes that make use of the wide range of open-source tools, including Apache Spark and Presto. 

  • Actually, the Hive Metastore serves as the foundation for an entire ecosystem of tools, some of which are depicted in this diagram.







Hive ACID:

Hive provides full acid support for ORC tables out and insert only support to all other formats.

ACID stands for four traits of database transactions:  
  1. Atomicity (an operation either succeeds completely or fails, it does not leave partial data).
  2. Consistency (once an application performs an operation the results of that operation are visible to it in every subsequent operation).
  3. Isolation (an incomplete operation by one user does not cause unexpected side effects for other users).
  4. Durability (once an operation is complete it will be preserved even in the face of machine or system failure).

These traits have long been expected of database systems as part of their transaction functionality.  

 Hive stores the DATA into HDFS and SCHEMA into RDBMS (Derby, SQL, etc.)

  1. When user creates table, a schema is created in RDBMS
  2. When data is entered, files are created in HDFS. User can also directly put files into HDFS without interacting with RDBMS.
  3. Schema while reading data concept - Now when table is read - then Hive will check the schema and most importantly line delimiter and field delimiter.

As per delimiters rows and fields will be read from file. And a table will be formed to send to user.

e.g.

As per table definition line delimiter is '\n' (new line) and field delimiter is ',' (comma)

Then file in HDFS would -

1,Employee_Name1,1000

2,Employee_Name2,2000

And while reading this file Hive would assign the 2 rows and 3 columns each to the table.

Interesting part -

  • Now even if the file we put directly into HDFS is anything like lyrics of song. Then also Hive will not throw any exception.
  • Hive will just check line delimiter to create multiple rows of table. And check field delimiter to check for multiple columns in a row.
  • Now if any line/field delimiter is not present in the file then all the data of song lyrics would be put inside first column of first row in table.

Friday, September 1, 2023


 The level of interdependence between the variables in your dataset should be identified and quantified. By being aware of these interdependencies, you can better prepare your data to match the demands of machine learning algorithms like linear regression, whose performance will suffer as a result.

You will learn how to compute correlation for various types of variables and relationships in this course. Correlation is the statistical summary of the relationship between variables.

There are 4 sections in this tutorial; they are as follows:

  1. Correlation: What Is It?
  2. Covariance of the Test Dataset
  3. Correlation by Pearson
  4. Using Spearman's Correlation

Correlation: What Is It?

There are many different ways that variables in a dataset might be related or connected.
For instance:
  • The values of one variable may influence or be dependent upon the values of another.
  • There may be a very slight correlation between two variables.
  • A third unknowable variable may be dependent on two other variables.
The ability to better grasp the relationships between variables can be helpful in data analysis and modeling. The term "correlation" refers to the statistical association between two variables.

A correlation can be either positive or negative, meaning that when one variable's value changes, the other variable's value also changes in the same way. The variables may not be correlated if the correlation is 0 or neutral.
    1. Both variables fluctuate in the same direction when there is a positive correlation.
    2. No correlation exists between the changes in the variables in a neutral correlation.
    3. In a negative correlation, the variables shift against each other.

Impact of Correlations on ML Models:

  • Multicollinearity, or the close relationship between two or more variables, might cause some algorithms to perform worse. For instance, in linear regression, one of the problematic associated variables should be eliminated to raise the model's accuracy.
  • In order to gain insight into which variables may or may not be relevant as input for constructing a model, we may also be interested in the correlation between input variables and the output variable.

Generating Data for the Correlation Analysis:


# generate related variables
from numpy import mean
from numpy import std
from numpy.random import randn
from numpy.random import seed
from matplotlib import pyplot
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# summarize
print('data1: mean=%.3f stdv=%.3f' % (mean(data1), std(data1)))
print('data2: mean=%.3f stdv=%.3f' % (mean(data2), std(data2)))
# plot
pyplot.scatter(data1, data2)
pyplot.show()

data1: mean=100.776 stdv=19.620 data2: mean=151.050 stdv=22.358
Covariance pertains to the potential association between variables through a linear connection, where this connection remains consistently additive across both sets of data.

This connection can be succinctly described as the covariance between two variables. It is determined by averaging the product of values from each data set, after those values have been adjusted to be centered (by subtracting their mean).

The formula for computing the sample covariance is outlined as follows:

cov(X, Y) = (sum (x - mean(X)) * (y - mean(Y)) ) * 1/(n-1)


Utilizing the mean in the computation implies a requirement for Gaussian or Gaussian-like distribution in each data sample. Covariance's sign signifies whether variables change together (positive) or diverge (negative). Magnitude's interpretation is complex. A covariance of zero means full independence.

NumPy's cov() function computes a covariance matrix for multiple variables.
covariance = cov(data1, data2)

The matrix's diagonal holds the self-covariance of each variable. The other entries signify the covariance between the paired variables; given that only two variables are under consideration, these remaining entries are identical.

We can derive the covariance matrix for the given pair of variables in our test scenario. Here's the complete illustration:


from numpy.random import randn
from numpy.random import seed
from numpy import cov

# Set random seed
seed(1)

# Prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)

# Calculate covariance matrix
covariance = cov(data1, data2)
print(covariance)

[[385.33297729 389.7545618 ] [389.7545618 500.38006058]]

Covariance and covariance matrix play a crucial role in statistics and multivariate analysis for describing relationships among variables.

By executing the example, the covariance matrix is computed and displayed.

Since the dataset involves variables drawn from Gaussian distribution and exhibits linear correlation, covariance is a suitable approach for characterization.

The covariance between the two variables measures 389.75. This positive value indicates that the variables change in the expected direction together.

Using covariance as a standalone statistical tool is problematic due to its complex interpretation, prompting the introduction of Pearson's correlation coefficient.


Pearson’s Correlation:

The Pearson correlation coefficient, named after Karl Pearson, summarizes linear relationship strength between data samples.

It's calculated by dividing the covariance of variables by the product of their standard deviations, normalizing the covariance for an interpretable score.

 

Pearson's correlation coefficient = covariance(X, Y) / (stdv(X) * stdv(Y))

The requirement for mean and standard deviation use implies a Gaussian or Gaussian-like distribution for the data samples.

The outcome of the calculation, the correlation coefficient, provides insights into the relationship.

The coefficient ranges from -1 to 1, signifying the extent of correlation. 0 implies no correlation. Typically, values below -0.5 or above 0.5 indicate significant correlation, while values below these thresholds suggest weaker correlation.

To compute the Pearson’s correlation coefficient for data samples of equal length, one can utilize the pearsonr() function from SciPy.

# calculate the Pearson's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import pearsonr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate Pearson's correlation
corr, _ = pearsonr(data1, data2)
print('Pearsons correlation: %.3f' % corr)

Pearsons correlation: 0.888

The variables show a strong positive correlation with a coefficient of 0.8, indicating high association, similar to values near 1.0.

Pearson's correlation coefficient assesses relationships among multiple variables. This involves creating a correlation matrix by calculating interactions between each variable pair. The matrix is symmetrical, with 1.0 on the diagonal due to perfect self-correlation in each column.

Spearman’s Correlation

Variables can exhibit varying nonlinear relationships across their distributions. The distribution of these variables may also deviate from the Gaussian pattern.

For such cases, the Spearman's correlation coefficient, named after Charles Spearman, measures the strength of association between data samples. It's applicable for both nonlinear and linear relationships, with slightly reduced sensitivity (lower coefficients) in the latter case.

Similar to the Pearson correlation, scores range between -1 and 1 for perfect negative and positive correlations, respectively. However, Spearman's coefficient utilizes rank-based statistics rather than sample values, making it suitable for non-parametric analysis, where data distribution assumptions like Gaussian aren't made.

Spearman's correlation coefficient
= covariance(rank(X), rank(Y)) / (stdv(rank(X)) * stdv(rank(Y)))
# calculate the spearmans's correlation between two variables
from numpy.random import randn
from numpy.random import seed
from scipy.stats import spearmanr
# seed random number generator
seed(1)
# prepare data
data1 = 20 * randn(1000) + 100
data2 = data1 + (10 * randn(1000) + 50)
# calculate spearman's correlation
corr, _ = spearmanr(data1, data2)
print('Spearmans correlation: %.3f' % corr)


Spearmans correlation: 0.872

Despite assuming Gaussian data and a linear variable relationship, the nonparametric rank-based method reveals a robust 0.8 correlation between the variables.

Wednesday, August 30, 2023


Question: 

Lanternfish

You are in presence of specific species of lanternfish. They have one special attribute, each lanternfish creates a new lanternfish once every 7 days.

However, this process isn’t necessarily synchronized between every lanternfish - one lanternfish might have 2 days left until it creates another lanternfish, while another might have 4. So, you can model each fish as a single number that represents the number of days until it creates a new lanternfish.

Furthermore, you reason, a new lanternfish would surely need slightly longer before it’s capable of producing more lanternfish: two more days for its first cycle.

So, suppose you have a lanternfish with an internal timer value of 3:

After one day, its internal timer would become 2.

After another day, its internal timer would become 1.

After another day, its internal timer would become 0.

After another day, its internal timer would reset to 6, and it would create a new lanternfish with an internal timer of 8.

After another day, the first lanternfish would have an internal timer of 5, and the second lanternfish would have an internal timer of 7.

A lanternfish that creates a new fish resets its timer to 6, not 7 (because 0 is included as a valid timer value). The new lanternfish starts with an internal timer of 8 and does not start counting down until the next day.

For example, suppose you were given the following list:

3,4,3,1,2

This list means that the first fish has an internal timer of 3, the second fish has an internal timer of 4, and so on until the fifth fish, which has an internal timer of 2. Simulating these fish over several days would proceed as follows:

Initial state: 3,4,3,1,2

After 1 day: 2,3,2,0,1

After 2 days: 1,2,1,6,0,8

After 3 days: 0,1,0,5,6,7,8

After 4 days: 6,0,6,4,5,6,7,8,8

After 5 days: 5,6,5,3,4,5,6,7,7,8

After 6 days: 4,5,4,2,3,4,5,6,6,7

After 7 days: 3,4,3,1,2,3,4,5,5,6

After 8 days: 2,3,2,0,1,2,3,4,4,5

After 9 days: 1,2,1,6,0,1,2,3,3,4,8

After 10 days: 0,1,0,5,6,0,1,2,2,3,7,8

After 11 days: 6,0,6,4,5,6,0,1,1,2,6,7,8,8,8

After 12 days: 5,6,5,3,4,5,6,0,0,1,5,6,7,7,7,8,8

After 13 days: 4,5,4,2,3,4,5,6,6,0,4,5,6,6,6,7,7,8,8

After 14 days: 3,4,3,1,2,3,4,5,5,6,3,4,5,5,5,6,6,7,7,8

After 15 days: 2,3,2,0,1,2,3,4,4,5,2,3,4,4,4,5,5,6,6,7

After 16 days: 1,2,1,6,0,1,2,3,3,4,1,2,3,3,3,4,4,5,5,6,8

After 17 days: 0,1,0,5,6,0,1,2,2,3,0,1,2,2,2,3,3,4,4,5,7,8

After 18 days: 6,0,6,4,5,6,0,1,1,2,6,0,1,1,1,2,2,3,3,4,6,7,8,8,8,8

Each day, a 0 becomes a 6 and adds a new 8 to the end of the list, while each other number decreases by 1 if it was present at the start of the day.

In this example, after 18 days, there are a total of 26 fish.

Question 1 (easy): How many lanternfish would there be after 80 days?

Question 2 (harder): How many lanternfish would there be after 400 days?

Ads Place 970 X 90

Big Data Concepts

Error and Resolutions

Differences